* [PATCH 0/3] Introduce user namespace capabilities @ 2024-05-16 9:22 Jonathan Calmels 2024-05-16 9:22 ` [PATCH 1/3] capabilities: " Jonathan Calmels ` (5 more replies) 0 siblings, 6 replies; 53+ messages in thread From: Jonathan Calmels @ 2024-05-16 9:22 UTC (permalink / raw To: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen Cc: containers, Jonathan Calmels, linux-kernel, linux-fsdevel, linux-security-module, keyrings It's that time of the year again where we debate security settings for user namespaces ;) I’ve been experimenting with different approaches to address the gripe around user namespaces being used as attack vectors. After invaluable feedback from Serge and Christian offline, this is what I came up with. There are obviously a lot of things we could do differently but I feel this is the right balance between functionality, simplicity and security. This also serves as a good foundation and could always be extended if the need arises in the future. Notes: - Adding a new capability set is far from ideal, but trying to reuse the existing capability framework was deemed both impractical and questionable security-wise, so here we are. - We might want to add new capabilities for some of the checks instead of reusing CAP_SETPCAP every time. Serge mentioned something like CAP_SYS_LIMIT? - In the last patch, we could decide to have stronger requirements and perform checks inside cap_capable() in case we want to retroactively prevent capabilities in old namespaces, this might be an overreach though so I left it out. I'm also not fond of the ulong logic for setting the sysctl parameter, on the other hand, the usermodhelper code always uses two u32s which makes it very confusing to set in userspace. Jonathan Calmels (3): capabilities: user namespace capabilities capabilities: add securebit for strict userns caps capabilities: add cap userns sysctl mask fs/proc/array.c | 9 ++++ include/linux/cred.h | 3 ++ include/linux/securebits.h | 1 + include/linux/user_namespace.h | 7 +++ include/uapi/linux/prctl.h | 7 +++ include/uapi/linux/securebits.h | 11 ++++- kernel/cred.c | 3 ++ kernel/sysctl.c | 10 ++++ kernel/umh.c | 16 +++++++ kernel/user_namespace.c | 83 ++++++++++++++++++++++++++++++--- security/commoncap.c | 59 +++++++++++++++++++++++ security/keys/process_keys.c | 3 ++ 12 files changed, 204 insertions(+), 8 deletions(-) -- 2.45.0 ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 1/3] capabilities: user namespace capabilities 2024-05-16 9:22 [PATCH 0/3] Introduce user namespace capabilities Jonathan Calmels @ 2024-05-16 9:22 ` Jonathan Calmels 2024-05-16 12:27 ` Jarkko Sakkinen ` (4 more replies) 2024-05-16 9:22 ` [PATCH 2/3] capabilities: add securebit for strict userns caps Jonathan Calmels ` (4 subsequent siblings) 5 siblings, 5 replies; 53+ messages in thread From: Jonathan Calmels @ 2024-05-16 9:22 UTC (permalink / raw To: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen Cc: containers, Jonathan Calmels, linux-kernel, linux-fsdevel, linux-security-module, keyrings Attackers often rely on user namespaces to get elevated (yet confined) privileges in order to target specific subsystems (e.g. [1]). Distributions have been pretty adamant that they need a way to configure these, most of them carry out-of-tree patches to do so, or plainly refuse to enable them. As a result, there have been multiple efforts over the years to introduce various knobs to control and/or disable user namespaces (e.g. [2][3][4]). While we acknowledge that there are already ways to control the creation of such namespaces (the most recent being a LSM hook), there are inherent issues with these approaches. Preventing the user namespace creation is not fine-grained enough, and in some cases, incompatible with various userspace expectations (e.g. container runtimes, browser sandboxing, service isolation) This patch addresses these limitations by introducing an additional capability set used to restrict the permissions granted when creating user namespaces. This way, processes can apply the principle of least privilege by configuring only the capabilities they need for their namespaces. For compatibility reasons, processes always start with a full userns capability set. On namespace creation, the userns capability set (pU) is assigned to the new effective (pE), permitted (pP) and bounding set (X) of the task: pU = pE = pP = X The userns capability set obeys the invariant that no bit can ever be set if it is not already part of the task’s bounding set. This ensures that no namespace can ever gain more privileges than its predecessors. Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit in the userns set requires its corresponding bit to be set in the permitted set. This effectively mimics the inheritable set rules and means that, by default, only root in the initial user namespace can gain userns capabilities: p’U = (pE & CAP_SETPCAP) ? X : (X & pP) Note that since userns capabilities are strictly hierarchical, policies can be enforced at various levels (e.g. init, pam_cap) and inherited by every child namespace. Here is a sample program that can be used to verify the functionality: /* * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. * * ./cap_userns_test unshare -r grep Cap /proc/self/status * CapInh: 0000000000000000 * CapPrm: 000001fffffdffff * CapEff: 000001fffffdffff * CapBnd: 000001fffffdffff * CapAmb: 0000000000000000 * CapUNs: 000001fffffdffff */ int main(int argc, char *argv[]) { if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) err(1, "cannot drop userns cap"); execvp(argv[1], argv + 1); err(1, "cannot exec"); } Link: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html Link: https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chromium.org Link: https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com Link: https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@git.sr.ht Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> --- fs/proc/array.c | 9 ++++++ include/linux/cred.h | 3 ++ include/uapi/linux/prctl.h | 7 +++++ kernel/cred.c | 3 ++ kernel/umh.c | 16 ++++++++++ kernel/user_namespace.c | 12 +++----- security/commoncap.c | 59 ++++++++++++++++++++++++++++++++++++ security/keys/process_keys.c | 3 ++ 8 files changed, 105 insertions(+), 7 deletions(-) diff --git a/fs/proc/array.c b/fs/proc/array.c index 34a47fb0c57f..364e8bb19f9d 100644 --- a/fs/proc/array.c +++ b/fs/proc/array.c @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) const struct cred *cred; kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset, cap_ambient; +#ifdef CONFIG_USER_NS + kernel_cap_t cap_userns; +#endif rcu_read_lock(); cred = __task_cred(p); @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) cap_effective = cred->cap_effective; cap_bset = cred->cap_bset; cap_ambient = cred->cap_ambient; +#ifdef CONFIG_USER_NS + cap_userns = cred->cap_userns; +#endif rcu_read_unlock(); render_cap_t(m, "CapInh:\t", &cap_inheritable); @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) render_cap_t(m, "CapEff:\t", &cap_effective); render_cap_t(m, "CapBnd:\t", &cap_bset); render_cap_t(m, "CapAmb:\t", &cap_ambient); +#ifdef CONFIG_USER_NS + render_cap_t(m, "CapUNs:\t", &cap_userns); +#endif } static inline void task_seccomp(struct seq_file *m, struct task_struct *p) diff --git a/include/linux/cred.h b/include/linux/cred.h index 2976f534a7a3..adab0031443e 100644 --- a/include/linux/cred.h +++ b/include/linux/cred.h @@ -124,6 +124,9 @@ struct cred { kernel_cap_t cap_effective; /* caps we can actually use */ kernel_cap_t cap_bset; /* capability bounding set */ kernel_cap_t cap_ambient; /* Ambient capability set */ +#ifdef CONFIG_USER_NS + kernel_cap_t cap_userns; /* User namespace capability set */ +#endif #ifdef CONFIG_KEYS unsigned char jit_keyring; /* default keyring to attach requested * keys to */ diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h index 370ed14b1ae0..e09475171f62 100644 --- a/include/uapi/linux/prctl.h +++ b/include/uapi/linux/prctl.h @@ -198,6 +198,13 @@ struct prctl_mm_map { # define PR_CAP_AMBIENT_LOWER 3 # define PR_CAP_AMBIENT_CLEAR_ALL 4 +/* Control the userns capability set */ +#define PR_CAP_USERNS 48 +# define PR_CAP_USERNS_IS_SET 1 +# define PR_CAP_USERNS_RAISE 2 +# define PR_CAP_USERNS_LOWER 3 +# define PR_CAP_USERNS_CLEAR_ALL 4 + /* arm64 Scalable Vector Extension controls */ /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ #define PR_SVE_SET_VL 50 /* set task vector length */ diff --git a/kernel/cred.c b/kernel/cred.c index 075cfa7c896f..9912c6f3bc6b 100644 --- a/kernel/cred.c +++ b/kernel/cred.c @@ -56,6 +56,9 @@ struct cred init_cred = { .cap_permitted = CAP_FULL_SET, .cap_effective = CAP_FULL_SET, .cap_bset = CAP_FULL_SET, +#ifdef CONFIG_USER_NS + .cap_userns = CAP_FULL_SET, +#endif .user = INIT_USER, .user_ns = &init_user_ns, .group_info = &init_groups, diff --git a/kernel/umh.c b/kernel/umh.c index 1b13c5d34624..51f1e1d25d49 100644 --- a/kernel/umh.c +++ b/kernel/umh.c @@ -32,6 +32,9 @@ #include <trace/events/module.h> +#ifdef CONFIG_USER_NS +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; +#endif static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; static DEFINE_SPINLOCK(umh_sysctl_lock); @@ -94,6 +97,10 @@ static int call_usermodehelper_exec_async(void *data) new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); new->cap_inheritable = cap_intersect(usermodehelper_inheritable, new->cap_inheritable); +#ifdef CONFIG_USER_NS + new->cap_userns = cap_intersect(usermodehelper_userns, + new->cap_userns); +#endif spin_unlock(&umh_sysctl_lock); if (sub_info->init) { @@ -560,6 +567,15 @@ static struct ctl_table usermodehelper_table[] = { .mode = 0600, .proc_handler = proc_cap_handler, }, +#ifdef CONFIG_USER_NS + { + .procname = "userns", + .data = &usermodehelper_userns, + .maxlen = 2 * sizeof(unsigned long), + .mode = 0600, + .proc_handler = proc_cap_handler, + }, +#endif { } }; diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 0b0b95418b16..7e624607330b 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) { - /* Start with the same capabilities as init but useless for doing - * anything as the capabilities are bound to the new user namespace. - */ - cred->securebits = SECUREBITS_DEFAULT; + /* Start with the capabilities defined in the userns set. */ + cred->cap_bset = cred->cap_userns; + cred->cap_permitted = cred->cap_userns; + cred->cap_effective = cred->cap_userns; cred->cap_inheritable = CAP_EMPTY_SET; - cred->cap_permitted = CAP_FULL_SET; - cred->cap_effective = CAP_FULL_SET; cred->cap_ambient = CAP_EMPTY_SET; - cred->cap_bset = CAP_FULL_SET; + cred->securebits = SECUREBITS_DEFAULT; #ifdef CONFIG_KEYS key_put(cred->request_key_auth); cred->request_key_auth = NULL; diff --git a/security/commoncap.c b/security/commoncap.c index 162d96b3a676..b3d3372bf910 100644 --- a/security/commoncap.c +++ b/security/commoncap.c @@ -228,6 +228,28 @@ static inline int cap_inh_is_capped(void) return 1; } +/* + * Determine whether a userns capability can be raised. + * Returns 1 if it can, 0 otherwise. + */ +#ifdef CONFIG_USER_NS +static inline int cap_uns_is_raiseable(unsigned long cap) +{ + if (!!cap_raised(current_cred()->cap_userns, cap)) + return 1; + /* a capability cannot be raised unless the current task has it in + * its bounding set and, without CAP_SETPCAP, its permitted set. + */ + if (!cap_raised(current_cred()->cap_bset, cap)) + return 0; + if (cap_capable(current_cred(), current_cred()->user_ns, + CAP_SETPCAP, CAP_OPT_NONE) != 0 && + !cap_raised(current_cred()->cap_permitted, cap)) + return 0; + return 1; +} +#endif + /** * cap_capset - Validate and apply proposed changes to current's capabilities * @new: The proposed new credentials; alterations should be made here @@ -1382,6 +1404,43 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, return commit_creds(new); } +#ifdef CONFIG_USER_NS + case PR_CAP_USERNS: + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { + if (arg3 | arg4 | arg5) + return -EINVAL; + + new = prepare_creds(); + if (!new) + return -ENOMEM; + cap_clear(new->cap_userns); + return commit_creds(new); + } + + if (((!cap_valid(arg3)) | arg4 | arg5)) + return -EINVAL; + + if (arg2 == PR_CAP_USERNS_IS_SET) { + return !!cap_raised(current_cred()->cap_userns, arg3); + } else if (arg2 != PR_CAP_USERNS_RAISE && + arg2 != PR_CAP_USERNS_LOWER) { + return -EINVAL; + } else { + if (arg2 == PR_CAP_USERNS_RAISE && + !cap_uns_is_raiseable(arg3)) + return -EPERM; + + new = prepare_creds(); + if (!new) + return -ENOMEM; + if (arg2 == PR_CAP_USERNS_RAISE) + cap_raise(new->cap_userns, arg3); + else + cap_lower(new->cap_userns, arg3); + return commit_creds(new); + } +#endif + default: /* No functionality available - continue with default */ return -ENOSYS; diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c index b5d5333ab330..e3670d815435 100644 --- a/security/keys/process_keys.c +++ b/security/keys/process_keys.c @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) new->cap_effective = old->cap_effective; new->cap_ambient = old->cap_ambient; new->cap_bset = old->cap_bset; +#ifdef CONFIG_USER_NS + new->cap_userns = old->cap_userns; +#endif new->jit_keyring = old->jit_keyring; new->thread_keyring = key_get(old->thread_keyring); -- 2.45.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-16 9:22 ` [PATCH 1/3] capabilities: " Jonathan Calmels @ 2024-05-16 12:27 ` Jarkko Sakkinen 2024-05-16 22:07 ` John Johansen ` (3 subsequent siblings) 4 siblings, 0 replies; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-16 12:27 UTC (permalink / raw To: Jonathan Calmels, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells Cc: containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings Some quick remarks, no bandwidth to understand this. First of all short summary does not define any action so it should be rather e.g. "capabilities: Add userns capabilities" Much more understandable. On Thu May 16, 2024 at 12:22 PM EEST, Jonathan Calmels wrote: > Attackers often rely on user namespaces to get elevated (yet confined) > privileges in order to target specific subsystems (e.g. [1]). Distributions > have been pretty adamant that they need a way to configure these, most of > them carry out-of-tree patches to do so, or plainly refuse to enable them. > As a result, there have been multiple efforts over the years to introduce > various knobs to control and/or disable user namespaces (e.g. [2][3][4]). > > While we acknowledge that there are already ways to control the creation of > such namespaces (the most recent being a LSM hook), there are inherent > issues with these approaches. Preventing the user namespace creation is not > fine-grained enough, and in some cases, incompatible with various userspace > expectations (e.g. container runtimes, browser sandboxing, service > isolation) > > This patch addresses these limitations by introducing an additional > capability set used to restrict the permissions granted when creating user > namespaces. This way, processes can apply the principle of least privilege > by configuring only the capabilities they need for their namespaces. > > For compatibility reasons, processes always start with a full userns > capability set. > > On namespace creation, the userns capability set (pU) is assigned to the > new effective (pE), permitted (pP) and bounding set (X) of the task: > > pU = pE = pP = X > > The userns capability set obeys the invariant that no bit can ever be set > if it is not already part of the task’s bounding set. This ensures that no > namespace can ever gain more privileges than its predecessors. > Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit > in the userns set requires its corresponding bit to be set in the permitted > set. This effectively mimics the inheritable set rules and means that, by > default, only root in the initial user namespace can gain userns > capabilities: > > p’U = (pE & CAP_SETPCAP) ? X : (X & pP) > > Note that since userns capabilities are strictly hierarchical, policies can > be enforced at various levels (e.g. init, pam_cap) and inherited by every > child namespace. > > Here is a sample program that can be used to verify the functionality: > > /* > * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. > * > * ./cap_userns_test unshare -r grep Cap /proc/self/status > * CapInh: 0000000000000000 > * CapPrm: 000001fffffdffff > * CapEff: 000001fffffdffff > * CapBnd: 000001fffffdffff > * CapAmb: 0000000000000000 > * CapUNs: 000001fffffdffff > */ > > int main(int argc, char *argv[]) > { > if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) > err(1, "cannot drop userns cap"); > > execvp(argv[1], argv + 1); > err(1, "cannot exec"); > } > > Link: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html > Link: https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chromium.org > Link: https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com > Link: https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@git.sr.ht > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> > --- > fs/proc/array.c | 9 ++++++ > include/linux/cred.h | 3 ++ > include/uapi/linux/prctl.h | 7 +++++ > kernel/cred.c | 3 ++ > kernel/umh.c | 16 ++++++++++ > kernel/user_namespace.c | 12 +++----- > security/commoncap.c | 59 ++++++++++++++++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 8 files changed, 105 insertions(+), 7 deletions(-) > > diff --git a/fs/proc/array.c b/fs/proc/array.c > index 34a47fb0c57f..364e8bb19f9d 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > const struct cred *cred; > kernel_cap_t cap_inheritable, cap_permitted, cap_effective, > cap_bset, cap_ambient; > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; > +#endif > > rcu_read_lock(); > cred = __task_cred(p); > @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > cap_effective = cred->cap_effective; > cap_bset = cred->cap_bset; > cap_ambient = cred->cap_ambient; > +#ifdef CONFIG_USER_NS > + cap_userns = cred->cap_userns; > +#endif > rcu_read_unlock(); > > render_cap_t(m, "CapInh:\t", &cap_inheritable); > @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > render_cap_t(m, "CapEff:\t", &cap_effective); > render_cap_t(m, "CapBnd:\t", &cap_bset); > render_cap_t(m, "CapAmb:\t", &cap_ambient); > +#ifdef CONFIG_USER_NS > + render_cap_t(m, "CapUNs:\t", &cap_userns); > +#endif > } > > static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > diff --git a/include/linux/cred.h b/include/linux/cred.h > index 2976f534a7a3..adab0031443e 100644 > --- a/include/linux/cred.h > +++ b/include/linux/cred.h > @@ -124,6 +124,9 @@ struct cred { > kernel_cap_t cap_effective; /* caps we can actually use */ > kernel_cap_t cap_bset; /* capability bounding set */ > kernel_cap_t cap_ambient; /* Ambient capability set */ > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; /* User namespace capability set */ > +#endif > #ifdef CONFIG_KEYS > unsigned char jit_keyring; /* default keyring to attach requested > * keys to */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 370ed14b1ae0..e09475171f62 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -198,6 +198,13 @@ struct prctl_mm_map { > # define PR_CAP_AMBIENT_LOWER 3 > # define PR_CAP_AMBIENT_CLEAR_ALL 4 > > +/* Control the userns capability set */ > +#define PR_CAP_USERNS 48 > +# define PR_CAP_USERNS_IS_SET 1 > +# define PR_CAP_USERNS_RAISE 2 > +# define PR_CAP_USERNS_LOWER 3 > +# define PR_CAP_USERNS_CLEAR_ALL 4 Kernel coding style does not support this way but instead recommends enum but apparently the whole header is not following that so I guess it is fine ;-) > + > /* arm64 Scalable Vector Extension controls */ > /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ > #define PR_SVE_SET_VL 50 /* set task vector length */ > diff --git a/kernel/cred.c b/kernel/cred.c > index 075cfa7c896f..9912c6f3bc6b 100644 > --- a/kernel/cred.c > +++ b/kernel/cred.c > @@ -56,6 +56,9 @@ struct cred init_cred = { > .cap_permitted = CAP_FULL_SET, > .cap_effective = CAP_FULL_SET, > .cap_bset = CAP_FULL_SET, > +#ifdef CONFIG_USER_NS > + .cap_userns = CAP_FULL_SET, > +#endif > .user = INIT_USER, > .user_ns = &init_user_ns, > .group_info = &init_groups, > diff --git a/kernel/umh.c b/kernel/umh.c > index 1b13c5d34624..51f1e1d25d49 100644 > --- a/kernel/umh.c > +++ b/kernel/umh.c > @@ -32,6 +32,9 @@ > > #include <trace/events/module.h> > > +#ifdef CONFIG_USER_NS > +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; > +#endif > static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; > static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; > static DEFINE_SPINLOCK(umh_sysctl_lock); > @@ -94,6 +97,10 @@ static int call_usermodehelper_exec_async(void *data) > new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); > new->cap_inheritable = cap_intersect(usermodehelper_inheritable, > new->cap_inheritable); > +#ifdef CONFIG_USER_NS > + new->cap_userns = cap_intersect(usermodehelper_userns, > + new->cap_userns); Could be also a single line (checkpatch.pl does not complain). > +#endif > spin_unlock(&umh_sysctl_lock); > > if (sub_info->init) { > @@ -560,6 +567,15 @@ static struct ctl_table usermodehelper_table[] = { > .mode = 0600, > .proc_handler = proc_cap_handler, > }, > +#ifdef CONFIG_USER_NS > + { > + .procname = "userns", > + .data = &usermodehelper_userns, > + .maxlen = 2 * sizeof(unsigned long), > + .mode = 0600, > + .proc_handler = proc_cap_handler, > + }, > +#endif > { } > }; > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 0b0b95418b16..7e624607330b 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) > > static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > { > - /* Start with the same capabilities as init but useless for doing > - * anything as the capabilities are bound to the new user namespace. > - */ > - cred->securebits = SECUREBITS_DEFAULT; > + /* Start with the capabilities defined in the userns set. */ > + cred->cap_bset = cred->cap_userns; > + cred->cap_permitted = cred->cap_userns; > + cred->cap_effective = cred->cap_userns; > cred->cap_inheritable = CAP_EMPTY_SET; > - cred->cap_permitted = CAP_FULL_SET; > - cred->cap_effective = CAP_FULL_SET; > cred->cap_ambient = CAP_EMPTY_SET; > - cred->cap_bset = CAP_FULL_SET; > + cred->securebits = SECUREBITS_DEFAULT; > #ifdef CONFIG_KEYS > key_put(cred->request_key_auth); > cred->request_key_auth = NULL; > diff --git a/security/commoncap.c b/security/commoncap.c > index 162d96b3a676..b3d3372bf910 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -228,6 +228,28 @@ static inline int cap_inh_is_capped(void) > return 1; > } > > +/* > + * Determine whether a userns capability can be raised. > + * Returns 1 if it can, 0 otherwise. > + */ > +#ifdef CONFIG_USER_NS > +static inline int cap_uns_is_raiseable(unsigned long cap) > +{ > + if (!!cap_raised(current_cred()->cap_userns, cap)) > + return 1; Empty line. > + /* a capability cannot be raised unless the current task has it in Incorrectly formatted comment: /* * Foo That is what's the correct format. > + * its bounding set and, without CAP_SETPCAP, its permitted set. > + */ > + if (!cap_raised(current_cred()->cap_bset, cap)) > + return 0; Empty line might be appropriate here. > + if (cap_capable(current_cred(), current_cred()->user_ns, > + CAP_SETPCAP, CAP_OPT_NONE) != 0 && > + !cap_raised(current_cred()->cap_permitted, cap)) I'd consider being dummy here to make this more easy to verify also in the future: create two bools and use them for final comparison. My head hurts reading that. > + return 0; > + return 1; > +} > +#endif > + > /** > * cap_capset - Validate and apply proposed changes to current's capabilities > * @new: The proposed new credentials; alterations should be made here > @@ -1382,6 +1404,43 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, > return commit_creds(new); > } > > +#ifdef CONFIG_USER_NS > + case PR_CAP_USERNS: > + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { > + if (arg3 | arg4 | arg5) > + return -EINVAL; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + cap_clear(new->cap_userns); > + return commit_creds(new); > + } > + > + if (((!cap_valid(arg3)) | arg4 | arg5)) > + return -EINVAL; > + > + if (arg2 == PR_CAP_USERNS_IS_SET) { > + return !!cap_raised(current_cred()->cap_userns, arg3); > + } else if (arg2 != PR_CAP_USERNS_RAISE && > + arg2 != PR_CAP_USERNS_LOWER) { > + return -EINVAL; > + } else { > + if (arg2 == PR_CAP_USERNS_RAISE && > + !cap_uns_is_raiseable(arg3)) > + return -EPERM; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + if (arg2 == PR_CAP_USERNS_RAISE) > + cap_raise(new->cap_userns, arg3); > + else > + cap_lower(new->cap_userns, arg3); > + return commit_creds(new); > + } > +#endif > + > default: > /* No functionality available - continue with default */ > return -ENOSYS; > diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c > index b5d5333ab330..e3670d815435 100644 > --- a/security/keys/process_keys.c > +++ b/security/keys/process_keys.c > @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) > new->cap_effective = old->cap_effective; > new->cap_ambient = old->cap_ambient; > new->cap_bset = old->cap_bset; > +#ifdef CONFIG_USER_NS > + new->cap_userns = old->cap_userns; > +#endif > > new->jit_keyring = old->jit_keyring; > new->thread_keyring = key_get(old->thread_keyring); BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-16 9:22 ` [PATCH 1/3] capabilities: " Jonathan Calmels 2024-05-16 12:27 ` Jarkko Sakkinen @ 2024-05-16 22:07 ` John Johansen 2024-05-17 10:51 ` Jonathan Calmels 2024-05-17 11:32 ` Eric W. Biederman ` (2 subsequent siblings) 4 siblings, 1 reply; 53+ messages in thread From: John Johansen @ 2024-05-16 22:07 UTC (permalink / raw To: Jonathan Calmels, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen Cc: containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On 5/16/24 02:22, Jonathan Calmels wrote: > Attackers often rely on user namespaces to get elevated (yet confined) > privileges in order to target specific subsystems (e.g. [1]). Distributions > have been pretty adamant that they need a way to configure these, most of > them carry out-of-tree patches to do so, or plainly refuse to enable them. > As a result, there have been multiple efforts over the years to introduce > various knobs to control and/or disable user namespaces (e.g. [2][3][4]). > > While we acknowledge that there are already ways to control the creation of > such namespaces (the most recent being a LSM hook), there are inherent > issues with these approaches. Preventing the user namespace creation is not > fine-grained enough, and in some cases, incompatible with various userspace agreed, though it really is application dependent. Some applications handle the denial at userns creation better, than the capability after. Others like anything based on QTWebEngine will crash on denial of userns creation but handle denial of the capability within the userns just fine, and some applications just crash regardless. The userns cred from the LSM hook can be modified, yes it is currently specified as const but is still under construction so it can be safely modified the LSM hook just needs a small update. The advantage of doing it under the LSM is an LSM can have a richer policy around what can use them and tracking of what is allowed. That is to say the LSM has the capability of being finer grained than doing it via capabilities. I am not opposed to adding another mechanism to control user namespaces, I am just not currently convinced that capabilities are the right mechanism. > expectations (e.g. container runtimes, browser sandboxing, service > isolation) > > This patch addresses these limitations by introducing an additional > capability set used to restrict the permissions granted when creating user > namespaces. This way, processes can apply the principle of least privilege > by configuring only the capabilities they need for their namespaces. > > For compatibility reasons, processes always start with a full userns > capability set. > > On namespace creation, the userns capability set (pU) is assigned to the > new effective (pE), permitted (pP) and bounding set (X) of the task: > > pU = pE = pP = X > this should be bounded by the creating task's bounding set, other wise the capability model's bounding invariant will be broken, but having the capabilities that the userns want to access in the task's bounding set is a problem for all the unprivileged processes wanting access to user namespaces. Simply setting the userns fcap on the programs that want access to user namespaces, does certainly reduce the attack surface, but really is insufficient for utilities like unshare, bwrap, lxd etc. They can be used to trivially by-pass the restriction. > The userns capability set obeys the invariant that no bit can ever be set > if it is not already part of the task’s bounding set. This ensures that no > namespace can ever gain more privileges than its predecessors. > Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit > in the userns set requires its corresponding bit to be set in the permitted > set. This effectively mimics the inheritable set rules and means that, by > default, only root in the initial user namespace can gain userns > capabilities: > > p’U = (pE & CAP_SETPCAP) ? X : (X & pP) > If I am reading this right for unprivileged processes the capabilities in the userns are bounded by the processes permitted set before the userns is created? This is only being respected in PR_CTL, the user mode helper is straight setting the caps. > Note that since userns capabilities are strictly hierarchical, policies can > be enforced at various levels (e.g. init, pam_cap) and inherited by every > child namespace. > > Here is a sample program that can be used to verify the functionality: > > /* > * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. > * > * ./cap_userns_test unshare -r grep Cap /proc/self/status > * CapInh: 0000000000000000 > * CapPrm: 000001fffffdffff > * CapEff: 000001fffffdffff > * CapBnd: 000001fffffdffff > * CapAmb: 0000000000000000 > * CapUNs: 000001fffffdffff > */ > > int main(int argc, char *argv[]) > { > if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) > err(1, "cannot drop userns cap"); > > execvp(argv[1], argv + 1); > err(1, "cannot exec"); > } > > Link: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html > Link: https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chromium.org > Link: https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com > Link: https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@git.sr.ht > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> > --- > fs/proc/array.c | 9 ++++++ > include/linux/cred.h | 3 ++ > include/uapi/linux/prctl.h | 7 +++++ > kernel/cred.c | 3 ++ > kernel/umh.c | 16 ++++++++++ > kernel/user_namespace.c | 12 +++----- > security/commoncap.c | 59 ++++++++++++++++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 8 files changed, 105 insertions(+), 7 deletions(-) > > diff --git a/fs/proc/array.c b/fs/proc/array.c > index 34a47fb0c57f..364e8bb19f9d 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > const struct cred *cred; > kernel_cap_t cap_inheritable, cap_permitted, cap_effective, > cap_bset, cap_ambient; > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; > +#endif > > rcu_read_lock(); > cred = __task_cred(p); > @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > cap_effective = cred->cap_effective; > cap_bset = cred->cap_bset; > cap_ambient = cred->cap_ambient; > +#ifdef CONFIG_USER_NS > + cap_userns = cred->cap_userns; > +#endif > rcu_read_unlock(); > > render_cap_t(m, "CapInh:\t", &cap_inheritable); > @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > render_cap_t(m, "CapEff:\t", &cap_effective); > render_cap_t(m, "CapBnd:\t", &cap_bset); > render_cap_t(m, "CapAmb:\t", &cap_ambient); > +#ifdef CONFIG_USER_NS > + render_cap_t(m, "CapUNs:\t", &cap_userns); > +#endif > } > > static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > diff --git a/include/linux/cred.h b/include/linux/cred.h > index 2976f534a7a3..adab0031443e 100644 > --- a/include/linux/cred.h > +++ b/include/linux/cred.h > @@ -124,6 +124,9 @@ struct cred { > kernel_cap_t cap_effective; /* caps we can actually use */ > kernel_cap_t cap_bset; /* capability bounding set */ > kernel_cap_t cap_ambient; /* Ambient capability set */ > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; /* User namespace capability set */ > +#endif > #ifdef CONFIG_KEYS > unsigned char jit_keyring; /* default keyring to attach requested > * keys to */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 370ed14b1ae0..e09475171f62 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -198,6 +198,13 @@ struct prctl_mm_map { > # define PR_CAP_AMBIENT_LOWER 3 > # define PR_CAP_AMBIENT_CLEAR_ALL 4 > > +/* Control the userns capability set */ > +#define PR_CAP_USERNS 48 > +# define PR_CAP_USERNS_IS_SET 1 > +# define PR_CAP_USERNS_RAISE 2 > +# define PR_CAP_USERNS_LOWER 3 > +# define PR_CAP_USERNS_CLEAR_ALL 4 > + > /* arm64 Scalable Vector Extension controls */ > /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ > #define PR_SVE_SET_VL 50 /* set task vector length */ > diff --git a/kernel/cred.c b/kernel/cred.c > index 075cfa7c896f..9912c6f3bc6b 100644 > --- a/kernel/cred.c > +++ b/kernel/cred.c > @@ -56,6 +56,9 @@ struct cred init_cred = { > .cap_permitted = CAP_FULL_SET, > .cap_effective = CAP_FULL_SET, > .cap_bset = CAP_FULL_SET, > +#ifdef CONFIG_USER_NS > + .cap_userns = CAP_FULL_SET, > +#endif > .user = INIT_USER, > .user_ns = &init_user_ns, > .group_info = &init_groups, > diff --git a/kernel/umh.c b/kernel/umh.c > index 1b13c5d34624..51f1e1d25d49 100644 > --- a/kernel/umh.c > +++ b/kernel/umh.c > @@ -32,6 +32,9 @@ > > #include <trace/events/module.h> > > +#ifdef CONFIG_USER_NS > +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; > +#endif > static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; > static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; > static DEFINE_SPINLOCK(umh_sysctl_lock); > @@ -94,6 +97,10 @@ static int call_usermodehelper_exec_async(void *data) > new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); > new->cap_inheritable = cap_intersect(usermodehelper_inheritable, > new->cap_inheritable); > +#ifdef CONFIG_USER_NS > + new->cap_userns = cap_intersect(usermodehelper_userns, > + new->cap_userns); > +#endif > spin_unlock(&umh_sysctl_lock); > > if (sub_info->init) { > @@ -560,6 +567,15 @@ static struct ctl_table usermodehelper_table[] = { > .mode = 0600, > .proc_handler = proc_cap_handler, > }, > +#ifdef CONFIG_USER_NS > + { > + .procname = "userns", > + .data = &usermodehelper_userns, > + .maxlen = 2 * sizeof(unsigned long), > + .mode = 0600, > + .proc_handler = proc_cap_handler, > + }, > +#endif > { } > }; > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 0b0b95418b16..7e624607330b 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) > > static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > { > - /* Start with the same capabilities as init but useless for doing > - * anything as the capabilities are bound to the new user namespace. > - */ > - cred->securebits = SECUREBITS_DEFAULT; > + /* Start with the capabilities defined in the userns set. */ > + cred->cap_bset = cred->cap_userns; > + cred->cap_permitted = cred->cap_userns; > + cred->cap_effective = cred->cap_userns; > cred->cap_inheritable = CAP_EMPTY_SET; > - cred->cap_permitted = CAP_FULL_SET; > - cred->cap_effective = CAP_FULL_SET; > cred->cap_ambient = CAP_EMPTY_SET; > - cred->cap_bset = CAP_FULL_SET; > + cred->securebits = SECUREBITS_DEFAULT; > #ifdef CONFIG_KEYS > key_put(cred->request_key_auth); > cred->request_key_auth = NULL; > diff --git a/security/commoncap.c b/security/commoncap.c > index 162d96b3a676..b3d3372bf910 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -228,6 +228,28 @@ static inline int cap_inh_is_capped(void) > return 1; > } > > +/* > + * Determine whether a userns capability can be raised. > + * Returns 1 if it can, 0 otherwise. > + */ > +#ifdef CONFIG_USER_NS > +static inline int cap_uns_is_raiseable(unsigned long cap) > +{ > + if (!!cap_raised(current_cred()->cap_userns, cap)) > + return 1; > + /* a capability cannot be raised unless the current task has it in > + * its bounding set and, without CAP_SETPCAP, its permitted set. > + */ > + if (!cap_raised(current_cred()->cap_bset, cap)) > + return 0; > + if (cap_capable(current_cred(), current_cred()->user_ns, > + CAP_SETPCAP, CAP_OPT_NONE) != 0 && > + !cap_raised(current_cred()->cap_permitted, cap)) > + return 0; > + return 1; > +} > +#endif > + > /** > * cap_capset - Validate and apply proposed changes to current's capabilities > * @new: The proposed new credentials; alterations should be made here > @@ -1382,6 +1404,43 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, > return commit_creds(new); > } > > +#ifdef CONFIG_USER_NS > + case PR_CAP_USERNS: > + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { > + if (arg3 | arg4 | arg5) > + return -EINVAL; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + cap_clear(new->cap_userns); > + return commit_creds(new); > + } > + > + if (((!cap_valid(arg3)) | arg4 | arg5)) > + return -EINVAL; > + > + if (arg2 == PR_CAP_USERNS_IS_SET) { > + return !!cap_raised(current_cred()->cap_userns, arg3); > + } else if (arg2 != PR_CAP_USERNS_RAISE && > + arg2 != PR_CAP_USERNS_LOWER) { > + return -EINVAL; > + } else { > + if (arg2 == PR_CAP_USERNS_RAISE && > + !cap_uns_is_raiseable(arg3)) > + return -EPERM; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + if (arg2 == PR_CAP_USERNS_RAISE) > + cap_raise(new->cap_userns, arg3); > + else > + cap_lower(new->cap_userns, arg3); > + return commit_creds(new); > + } > +#endif > + > default: > /* No functionality available - continue with default */ > return -ENOSYS; > diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c > index b5d5333ab330..e3670d815435 100644 > --- a/security/keys/process_keys.c > +++ b/security/keys/process_keys.c > @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) > new->cap_effective = old->cap_effective; > new->cap_ambient = old->cap_ambient; > new->cap_bset = old->cap_bset; > +#ifdef CONFIG_USER_NS > + new->cap_userns = old->cap_userns; > +#endif > > new->jit_keyring = old->jit_keyring; > new->thread_keyring = key_get(old->thread_keyring); ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-16 22:07 ` John Johansen @ 2024-05-17 10:51 ` Jonathan Calmels 2024-05-17 11:59 ` John Johansen 0 siblings, 1 reply; 53+ messages in thread From: Jonathan Calmels @ 2024-05-17 10:51 UTC (permalink / raw To: John Johansen Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu, May 16, 2024 at 03:07:28PM GMT, John Johansen wrote: > agreed, though it really is application dependent. Some applications handle > the denial at userns creation better, than the capability after. Others > like anything based on QTWebEngine will crash on denial of userns creation > but handle denial of the capability within the userns just fine, and some > applications just crash regardless. Yes this is application specific, but I would argue that the latter is much more preferable. For example, having one application crash in a container is probably ok, but not being able to start the container in the first place is probably not. Similarly, preventing the network namespace creation breaks services which rely on systemd’s PrivateNetwork, even though they most likely use it to prevent any networking from being done. > The userns cred from the LSM hook can be modified, yes it is currently > specified as const but is still under construction so it can be safely > modified the LSM hook just needs a small update. > > The advantage of doing it under the LSM is an LSM can have a richer policy > around what can use them and tracking of what is allowed. That is to say the > LSM has the capability of being finer grained than doing it via capabilities. Sure, we could modify the LSM hook to do all sorts of things, but leveraging it would be quite cumbersome, will take time to show up in userspace, or simply never be adopted. We’re already seeing it in Ubuntu which started requiring Apparmor profiles. This new capability set would be a universal thing that could be leveraged today without modification to userspace. Moreover, it’s a simple framework that can be extended. As you mentioned, LSMs are even finer grained, and that’s the idea, those could be used hand in hand eventually. You could envision LSM hooks controlling the userns capability set, and thus enforce policies on the creation of nested namespaces without limiting the other tasks’ capabilities. > I am not opposed to adding another mechanism to control user namespaces, > I am just not currently convinced that capabilities are the right > mechanism. Well that’s the thing, from past conversations, there is a lot of disagreement about restricting namespaces. By restricting the capabilities granted by namespaces instead, we’re actually treating the root cause of most concerns. Today user namespaces are "special" and always grant full caps. Adding a new capability set to limit this behavior is logical; same way it's done for usual process transitions. Essentially this set is to namespaces what the inheritable set is to root. > this should be bounded by the creating task's bounding set, other wise > the capability model's bounding invariant will be broken, but having the > capabilities that the userns want to access in the task's bounding set is > a problem for all the unprivileged processes wanting access to user > namespaces. This is possible with the security bit introduced in the second patch. The idea of having those separate is that a service which has dropped its capabilities can still create a fully privileged user namespace. For example, systemd’s machined drops capabilities from its bounding set, yet it should be able to create unprivileged containers. The invariant is sound because a child userns can never regain what it doesn’t have in its bounding set. If it helps you can view the userns set as a “namespace bounding set” since it defines the future bounding sets of namespaced tasks. > If I am reading this right for unprivileged processes the capabilities in > the userns are bounded by the processes permitted set before the userns is > created? Yes, unprivileged processes that want to raise a capability in their userns set need it in their permitted set (as well as their bounding set). This is similar to inheritable capabilities. Recall that processes start with a full set of userns capabilities, so if you drop a userns capability (or something else did, e.g. init/pam/sysctl/parent) you will never be able to regain it, and namespaces you create won't have it included. Now, if you’re root (or cap privileged) you can always regain it. > This is only being respected in PR_CTL, the user mode helper is straight > setting the caps. Usermod helper requires CAP_SYS_MODULE and CAP_SETPCAP in the initns so the permitted set is irrelevant there. It starts with a full set but from there you can only lower caps, so the invariant holds. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-17 10:51 ` Jonathan Calmels @ 2024-05-17 11:59 ` John Johansen 2024-05-18 3:50 ` Jonathan Calmels 0 siblings, 1 reply; 53+ messages in thread From: John Johansen @ 2024-05-17 11:59 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On 5/17/24 03:51, Jonathan Calmels wrote: > On Thu, May 16, 2024 at 03:07:28PM GMT, John Johansen wrote: >> agreed, though it really is application dependent. Some applications handle >> the denial at userns creation better, than the capability after. Others >> like anything based on QTWebEngine will crash on denial of userns creation >> but handle denial of the capability within the userns just fine, and some >> applications just crash regardless. > > Yes this is application specific, but I would argue that the latter is > much more preferable. For example, having one application crash in a > container is probably ok, but not being able to start the container in > the first place is probably not. Similarly, preventing the network > namespace creation breaks services which rely on systemd’s > PrivateNetwork, even though they most likely use it to prevent any > networking from being done. > Agred the solution has to be application/usage model specific. Some of them are easy, and others not so much. >> The userns cred from the LSM hook can be modified, yes it is currently >> specified as const but is still under construction so it can be safely >> modified the LSM hook just needs a small update. >> >> The advantage of doing it under the LSM is an LSM can have a richer policy >> around what can use them and tracking of what is allowed. That is to say the >> LSM has the capability of being finer grained than doing it via capabilities. > > Sure, we could modify the LSM hook to do all sorts of things, but > leveraging it would be quite cumbersome, will take time to show up in > userspace, or simply never be adopted. > We’re already seeing it in Ubuntu which started requiring Apparmor profiles. > yes, I would argue that is a metric of adoption. > This new capability set would be a universal thing that could be > leveraged today without modification to userspace. Moreover, it’s a > simple framework that can be extended. I would argue that is a problem. Userspace has to change for this to be secure. Is it an improvement over the current state yes. > As you mentioned, LSMs are even finer grained, and that’s the idea, > those could be used hand in hand eventually. You could envision LSM > hooks controlling the userns capability set, and thus enforce policies > on the creation of nested namespaces without limiting the other tasks’ > capabilities. > >> I am not opposed to adding another mechanism to control user namespaces, >> I am just not currently convinced that capabilities are the right >> mechanism. > > Well that’s the thing, from past conversations, there is a lot of > disagreement about restricting namespaces. By restricting the > capabilities granted by namespaces instead, we’re actually treating the > root cause of most concerns. > no disagreement there. This is actually Ubuntu's posture with user namespaces atm. Where the user namespace is allowed but the capabilities within it are denied. It does however when not handled correctly result in some very odd failures and would be easier to debug if the use of user namespaces were just cleanly denied. > Today user namespaces are "special" and always grant full caps. Adding a > new capability set to limit this behavior is logical; same way it's done > for usual process transitions. > Essentially this set is to namespaces what the inheritable set is to > root. > its not so much the capabilities set as the inheritable part that is problematic. Yes I am well aware of where that is required but I question that capabilities provides the needed controls here. >> this should be bounded by the creating task's bounding set, other wise >> the capability model's bounding invariant will be broken, but having the >> capabilities that the userns want to access in the task's bounding set is >> a problem for all the unprivileged processes wanting access to user >> namespaces. > > This is possible with the security bit introduced in the second patch. > The idea of having those separate is that a service which has dropped > its capabilities can still create a fully privileged user namespace. yes, which is the problem. Not that we don't do that with say setuid applications, but the difference is that they were known to be doing something dangerous and took measures around that. We are starting from a different posture here. Where applications have assumed that user namespaces where safe and no measures were needed. Tools like unshare and bwrap if set to allow user namespaces in their fcaps will allow exploits a trivial by-pass. > For example, systemd’s machined drops capabilities from its bounding set, > yet it should be able to create unprivileged containers. > The invariant is sound because a child userns can never regain what it > doesn’t have in its bounding set. If it helps you can view the userns > set as a “namespace bounding set” since it defines the future bounding > sets of namespaced tasks. > sure I get it, some of the use cases work, some not so well >> If I am reading this right for unprivileged processes the capabilities in >> the userns are bounded by the processes permitted set before the userns is >> created? > > Yes, unprivileged processes that want to raise a capability in their > userns set need it in their permitted set (as well as their bounding > set). This is similar to inheritable capabilities. Right. > Recall that processes start with a full set of userns capabilities, so > if you drop a userns capability (or something else did, e.g. > init/pam/sysctl/parent) you will never be able to regain it, and > namespaces you create won't have it included. sure, that part of the behavior is fine > Now, if you’re root (or cap privileged) you can always regain it. > yes What I was trying to get at is two points. 1. The written description wasn't clear enough, leaving room for ambiguity. 2. That I quest that the behavior should be allowed given the current set of tools that use user namespaces. It reduces exploit codes ability to directly use unprivileged user namespaces but makes it all to easy to by-pass the restriction because of the behavior of the current tool set. ie. user space has to change. >> This is only being respected in PR_CTL, the user mode helper is straight >> setting the caps. > > Usermod helper requires CAP_SYS_MODULE and CAP_SETPCAP in the initns so > the permitted set is irrelevant there. It starts with a full set but from > there you can only lower caps, so the invariant holds. > sure, I get what is happening. Again the description needs work. It was ambiguous as to whether it was applying to the fcaps or only the pcaps. But again, I believe the fcaps behavior is wrong, because of the state of current software. If this had been a proposal where there was no existing software infrastructure I would be starting from a different stance. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-17 11:59 ` John Johansen @ 2024-05-18 3:50 ` Jonathan Calmels 2024-05-18 12:27 ` John Johansen 0 siblings, 1 reply; 53+ messages in thread From: Jonathan Calmels @ 2024-05-18 3:50 UTC (permalink / raw To: John Johansen Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Fri, May 17, 2024 at 04:59:41AM GMT, John Johansen wrote: > On 5/17/24 03:51, Jonathan Calmels wrote: > > This new capability set would be a universal thing that could be > > leveraged today without modification to userspace. Moreover, it’s a > > simple framework that can be extended. > > I would argue that is a problem. Userspace has to change for this to be > secure. Is it an improvement over the current state yes. Well, yes and no. With those patches, I can lock down things today on my system and I don't need to change anything. For example I can decide that none of my rootless containers started under SSH will get CAP_NET_ADMIN: # echo "auth optional pam_cap.so" >> /etc/pam.d/sshd # echo "!cap_net_admin $USER" >> /etc/security/capability.conf # capsh --secbits=$((1 << 8)) -- -c /usr/sbin/sshd $ ssh localhost 'unshare -r capsh --current' Current: =ep cap_net_admin-ep Current IAB: !cap_net_admin Or I can decide than I don't ever want CAP_SYS_RAWIO in my namespaces: # sysctl -w cap_bound_userns_mask=0x1fffffdffff This doesn't require changes to userspace. Now, granted if you want to have finer-grained controls, it will require *some* changes in *some* places (e.g. adding new systemd property like UserNSSet=). > > Well that’s the thing, from past conversations, there is a lot of > > disagreement about restricting namespaces. By restricting the > > capabilities granted by namespaces instead, we’re actually treating the > > root cause of most concerns. > > > no disagreement there. This is actually Ubuntu's posture with user namespaces > atm. Where the user namespace is allowed but the capabilities within it > are denied. > > It does however when not handled correctly result in some very odd failures > and would be easier to debug if the use of user namespaces were just > cleanly denied. Yes but as we established it depends on the use case, both are not mutually exclusive. > its not so much the capabilities set as the inheritable part that is > problematic. Yes I am well aware of where that is required but I question > that capabilities provides the needed controls here. Again, I'm not opposed to doing this with LSMs. I just think both could work well together. We already do that with standard capabilities vs LSMs, both have their strength and weaknesses. It's always a tradeoff, do you want a setting that's universal and coarse, or do you want one that's tailored to specific things but less ubiquitous. It's also a tradeoff on usability. If this doesn't get used in practice, then there is no point. I would argue that even though capabilities are complicated, they are more widely understood than LSMs. Are capabilities insufficient in certain scenarios, absolutely, and that's usually where LSMs come in. > > This is possible with the security bit introduced in the second patch. > > The idea of having those separate is that a service which has dropped > > its capabilities can still create a fully privileged user namespace. > > yes, which is the problem. Not that we don't do that with say setuid > applications, but the difference is that they were known to be doing > something dangerous and took measures around that. > > We are starting from a different posture here. Where applications have > assumed that user namespaces where safe and no measures were needed. > Tools like unshare and bwrap if set to allow user namespaces in their > fcaps will allow exploits a trivial by-pass. Agreed, but we can't really walk back this decision unfortunately. At least with this patch series system administrators have the ability to limit such tools. > What I was trying to get at is two points. > 1. The written description wasn't clear enough, leaving room for > ambiguity. > 2. That I quest that the behavior should be allowed given the > current set of tools that use user namespaces. It reduces exploit > codes ability to directly use unprivileged user namespaces but > makes it all to easy to by-pass the restriction because of the > behavior of the current tool set. ie. user space has to change. > But again, I believe the fcaps behavior is wrong, because of the state of > current software. If this had been a proposal where there was no existing > software infrastructure I would be starting from a different stance. As mentioned above, userspace doesn't necessarily have to change. I'm also not sure what you mean by easy to by-pass? If I mask off some capabilities system wide or in a given process tree, I know for a fact that no namespace will ever get those capabilities. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-18 3:50 ` Jonathan Calmels @ 2024-05-18 12:27 ` John Johansen 2024-05-19 1:33 ` Jonathan Calmels 0 siblings, 1 reply; 53+ messages in thread From: John Johansen @ 2024-05-18 12:27 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On 5/17/24 20:50, Jonathan Calmels wrote: > On Fri, May 17, 2024 at 04:59:41AM GMT, John Johansen wrote: >> On 5/17/24 03:51, Jonathan Calmels wrote: >>> This new capability set would be a universal thing that could be >>> leveraged today without modification to userspace. Moreover, it’s a >>> simple framework that can be extended. >> >> I would argue that is a problem. Userspace has to change for this to be >> secure. Is it an improvement over the current state yes. > > Well, yes and no. With those patches, I can lock down things today on my > system and I don't need to change anything. > sure, same as with the big no user ns toggle. This is finer and allows selectively enabling on a per application basis. > For example I can decide that none of my rootless containers started > under SSH will get CAP_NET_ADMIN: > > # echo "auth optional pam_cap.so" >> /etc/pam.d/sshd > # echo "!cap_net_admin $USER" >> /etc/security/capability.conf > # capsh --secbits=$((1 << 8)) -- -c /usr/sbin/sshd > > $ ssh localhost 'unshare -r capsh --current' > Current: =ep cap_net_admin-ep > Current IAB: !cap_net_admin > > Or I can decide than I don't ever want CAP_SYS_RAWIO in my namespaces: > > # sysctl -w cap_bound_userns_mask=0x1fffffdffff > > This doesn't require changes to userspace. > Now, granted if you want to have finer-grained controls, it will require > *some* changes in *some* places (e.g. adding new systemd property like > UserNSSet=). > yep >>> Well that’s the thing, from past conversations, there is a lot of >>> disagreement about restricting namespaces. By restricting the >>> capabilities granted by namespaces instead, we’re actually treating the >>> root cause of most concerns. >>> >> no disagreement there. This is actually Ubuntu's posture with user namespaces >> atm. Where the user namespace is allowed but the capabilities within it >> are denied. >> >> It does however when not handled correctly result in some very odd failures >> and would be easier to debug if the use of user namespaces were just >> cleanly denied. > > Yes but as we established it depends on the use case, both are not > mutually exclusive. > yep >> its not so much the capabilities set as the inheritable part that is >> problematic. Yes I am well aware of where that is required but I question >> that capabilities provides the needed controls here. > > Again, I'm not opposed to doing this with LSMs. I just think both could > work well together. We already do that with standard capabilities vs > LSMs, both have their strength and weaknesses. > yes, don't get me wrong I am not necessarily advocating an LSM solution as being necessary. I just want to make sure the trade-offs of the capabilities solution get discussed to help evaluate whether extending the current capability model is worth it. > It's always a tradeoff, do you want a setting that's universal and > coarse, or do you want one that's tailored to specific things but less > ubiquitous. > yep > It's also a tradeoff on usability. If this doesn't get used in practice, > then there is no point. agreed > I would argue that even though capabilities are complicated, they are > more widely understood than LSMs. Are capabilities insufficient in > certain scenarios, absolutely, and that's usually where LSMs come in. > hrmmm, I am not sure I would agree with capabilities are better understood than LSMs, At the base level of capability(X) to get permission yes, but the whole permitting, bounding, ... Really I think most people are just confused by both >>> This is possible with the security bit introduced in the second patch. >>> The idea of having those separate is that a service which has dropped >>> its capabilities can still create a fully privileged user namespace. >> >> yes, which is the problem. Not that we don't do that with say setuid >> applications, but the difference is that they were known to be doing >> something dangerous and took measures around that. >> >> We are starting from a different posture here. Where applications have >> assumed that user namespaces where safe and no measures were needed. >> Tools like unshare and bwrap if set to allow user namespaces in their >> fcaps will allow exploits a trivial by-pass. > > Agreed, but we can't really walk back this decision unfortunately. And that is partly the crux of the issue, if we can't walk back the decision then the solution becomes more complex > At least with this patch series system administrators have the ability > to limit such tools. > agreed >> What I was trying to get at is two points. >> 1. The written description wasn't clear enough, leaving room for >> ambiguity. >> 2. That I quest that the behavior should be allowed given the >> current set of tools that use user namespaces. It reduces exploit >> codes ability to directly use unprivileged user namespaces but >> makes it all to easy to by-pass the restriction because of the >> behavior of the current tool set. ie. user space has to change. > >> But again, I believe the fcaps behavior is wrong, because of the state of >> current software. If this had been a proposal where there was no existing >> software infrastructure I would be starting from a different stance. > > As mentioned above, userspace doesn't necessarily have to change. I'm > also not sure what you mean by easy to by-pass? If I mask off some > capabilities system wide or in a given process tree, I know for a fact > that no namespace will ever get those capabilities. so by-pass will very much depend on the system but from a distro pov we pretty much have to have bwrap enabled if users want to use flatpaks (and they do), same story for several other tools. Since this basically means said tools need to be available by default, most systems the distro is installed on are vulnerable by default. The trivial by-pass then becomes the exploit running its payload through one of these tools, and yes I have tested it. Could a distro disable these tools by default, and require the user/admin to enable them, yes though there would be a lot of friction, push back, and in the end most systems would still end up with them enabled. With the capibilities approach can a user/admin make their system more secure than the current situation, absolutely. Note, that regardless of what happens with patch 1, and 2. I think we either need the big sysctl toggle, or a version of your patch 3 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-18 12:27 ` John Johansen @ 2024-05-19 1:33 ` Jonathan Calmels 0 siblings, 0 replies; 53+ messages in thread From: Jonathan Calmels @ 2024-05-19 1:33 UTC (permalink / raw To: John Johansen Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Sat, May 18, 2024 at 05:27:27AM GMT, John Johansen wrote: > On 5/17/24 20:50, Jonathan Calmels wrote: > > As mentioned above, userspace doesn't necessarily have to change. I'm > > also not sure what you mean by easy to by-pass? If I mask off some > > capabilities system wide or in a given process tree, I know for a fact > > that no namespace will ever get those capabilities. > > so by-pass will very much depend on the system but from a distro pov > we pretty much have to have bwrap enabled if users want to use flatpaks > (and they do), same story for several other tools. Since this basically > means said tools need to be available by default, most systems the > distro is installed on are vulnerable by default. The trivial by-pass > then becomes the exploit running its payload through one of these tools, > and yes I have tested it. > > Could a distro disable these tools by default, and require the user/admin > to enable them, yes though there would be a lot of friction, push back, > and in the end most systems would still end up with them enabled. > > With the capibilities approach can a user/admin make their system > more secure than the current situation, absolutely. > > Note, that regardless of what happens with patch 1, and 2. I think we > either need the big sysctl toggle, or a version of your patch 3 Ah ok, I get you concerns. Unfortunately, I can't really speak for distros or tooling about how this gets leveraged. I've never claimed this was going to be bulletproof day 1. All I'm saying is that they now have the option to do so. As you pointed out, we're coming from a model where today it's open-bar. Only now they can put a bouncer in front of it, so to speak :) Regarding distros: Maybe they ship with an empty userns mask by default and admins have to tweak it, understanding full well the consequences of doing so. Maybe they ship with a conservative mask and use pam rules to adjust it. Maybe they introduce something like a wheel/sudo group that you need to be part of to gain extra privileges in your userns. Maybe only some system services (e.g. dockerd, lxd/incusd, machined) get confined. Maybe they need highly specific policies, and this is where you'll would want LSM support. Say an Apparmor profile targetting unshare(1) specifically. Regarding tools: Maybe bwrap has its own group you need to be part of to get full caps. Maybe docker uses this set behind `--cap-add` `--cap-drop`. Maybe lxd/incusd imlement ACL restricting who can do what. Maybe steam always drops everything it doesn't need, I'm sure this won't cover every single corner cases, but as stated in the headline, this is a start, a simple framework we can always extend if needed in the future. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-16 9:22 ` [PATCH 1/3] capabilities: " Jonathan Calmels 2024-05-16 12:27 ` Jarkko Sakkinen 2024-05-16 22:07 ` John Johansen @ 2024-05-17 11:32 ` Eric W. Biederman 2024-05-17 11:55 ` Jonathan Calmels 2024-05-20 3:30 ` Serge E. Hallyn 2024-05-20 3:36 ` Serge E. Hallyn 4 siblings, 1 reply; 53+ messages in thread From: Eric W. Biederman @ 2024-05-17 11:32 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings Jonathan Calmels <jcalmels@3xx0.net> writes: > Attackers often rely on user namespaces to get elevated (yet confined) > privileges in order to target specific subsystems (e.g. [1]). Distributions > have been pretty adamant that they need a way to configure these, most of > them carry out-of-tree patches to do so, or plainly refuse to enable > them. Pointers please? That sentence sounds about 5 years out of date. Eric ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-17 11:32 ` Eric W. Biederman @ 2024-05-17 11:55 ` Jonathan Calmels 2024-05-17 12:48 ` John Johansen 2024-05-17 14:22 ` Eric W. Biederman 0 siblings, 2 replies; 53+ messages in thread From: Jonathan Calmels @ 2024-05-17 11:55 UTC (permalink / raw To: Eric W. Biederman Cc: brauner, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Fri, May 17, 2024 at 06:32:46AM GMT, Eric W. Biederman wrote: > > Pointers please? > > That sentence sounds about 5 years out of date. The link referenced is from last year. Here are some others often cited by distributions: https://nvd.nist.gov/vuln/detail/CVE-2022-0185 https://nvd.nist.gov/vuln/detail/CVE-2022-1015 https://nvd.nist.gov/vuln/detail/CVE-2022-2078 https://nvd.nist.gov/vuln/detail/CVE-2022-24122 https://nvd.nist.gov/vuln/detail/CVE-2022-25636 Recent thread discussing this too: https://seclists.org/oss-sec/2024/q2/128 ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-17 11:55 ` Jonathan Calmels @ 2024-05-17 12:48 ` John Johansen 2024-05-17 14:22 ` Eric W. Biederman 1 sibling, 0 replies; 53+ messages in thread From: John Johansen @ 2024-05-17 12:48 UTC (permalink / raw To: Jonathan Calmels, Eric W. Biederman Cc: brauner, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On 5/17/24 04:55, Jonathan Calmels wrote: > On Fri, May 17, 2024 at 06:32:46AM GMT, Eric W. Biederman wrote: >> >> Pointers please? >> >> That sentence sounds about 5 years out of date. > > The link referenced is from last year. > Here are some others often cited by distributions: > > https://nvd.nist.gov/vuln/detail/CVE-2022-0185 > https://nvd.nist.gov/vuln/detail/CVE-2022-1015 > https://nvd.nist.gov/vuln/detail/CVE-2022-2078 > https://nvd.nist.gov/vuln/detail/CVE-2022-24122 > https://nvd.nist.gov/vuln/detail/CVE-2022-25636 > > Recent thread discussing this too: > https://seclists.org/oss-sec/2024/q2/128 > they were used in 2020, 2021, and 2022 pwn2own exploits. Sorry I don't remember the exact numbers and will have to dig. pwn2own 2023 4/5 hacks used them https://www.zerodayinitiative.com/blog/2023/3/23/pwn2own-vancouver-2023-day-two-results I will need to dig to find the CVEs associated with them. pwn2own 2024 I can not discuss atm but its not just pwn2own, the actual list of kernel CVEs that unprivileged user namespaces make exploitable is much larger. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-17 11:55 ` Jonathan Calmels 2024-05-17 12:48 ` John Johansen @ 2024-05-17 14:22 ` Eric W. Biederman 2024-05-17 18:02 ` Jonathan Calmels 2024-05-21 15:52 ` John Johansen 1 sibling, 2 replies; 53+ messages in thread From: Eric W. Biederman @ 2024-05-17 14:22 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings Jonathan Calmels <jcalmels@3xx0.net> writes: > On Fri, May 17, 2024 at 06:32:46AM GMT, Eric W. Biederman wrote: >> >> Pointers please? >> >> That sentence sounds about 5 years out of date. > > The link referenced is from last year. > Here are some others often cited by distributions: > > https://nvd.nist.gov/vuln/detail/CVE-2022-0185 > https://nvd.nist.gov/vuln/detail/CVE-2022-1015 > https://nvd.nist.gov/vuln/detail/CVE-2022-2078 > https://nvd.nist.gov/vuln/detail/CVE-2022-24122 > https://nvd.nist.gov/vuln/detail/CVE-2022-25636 > > Recent thread discussing this too: > https://seclists.org/oss-sec/2024/q2/128 My apologies perhaps I trimmed too much. I know that user namespaces enlarge the attack surface. How much and how serious could be debated but for unprivileged users the attack surface is undoubtedly enlarged. As I read your introduction you were justifying the introduction of a new security mechanism with the observation that distributions were carrying distribution specific patches. To the best of my knowledge distribution specific patches and distributions disabling user namespaces have been gone for quite a while. So if that has changed recently I would like to know. Thank you, Eric ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-17 14:22 ` Eric W. Biederman @ 2024-05-17 18:02 ` Jonathan Calmels 2024-05-21 15:52 ` John Johansen 1 sibling, 0 replies; 53+ messages in thread From: Jonathan Calmels @ 2024-05-17 18:02 UTC (permalink / raw To: Eric W. Biederman Cc: brauner, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings > > On Fri, May 17, 2024 at 06:32:46AM GMT, Eric W. Biederman wrote: > As I read your introduction you were justifying the introduction > of a new security mechanism with the observation that distributions > were carrying distribution specific patches. > > To the best of my knowledge distribution specific patches and > distributions disabling user namespaces have been gone for quite a > while. So if that has changed recently I would like to know. On the top of my head: - RHEL based: namespace.unpriv_enable user_namespace.enable - Arch/Debian based: kernel.unprivileged_userns_clone - Ubuntu based: kernel.apparmor_restrict_unprivileged_userns I'm not sure which exact version those apply to, but it's definitely still out there. The observation is that while you can disable namespaces today, in practice it breaks userspace in various ways. Hence, being able to control capabilities is a better way to approach it. For example, today's big hammer to prevent CAP_NET_ADMIN in userns: # sysctl -qw user.max_net_namespaces=0 $ unshare -U -r -n ip tuntap add mode tap tap0 && echo OK unshare: unshare failed: No space left on device With patch, this becomes manageable: # capsh --drop=cap_net_admin --secbits=$((1 << 8)) --user=$USER -- \ -c 'unshare -U -r -n ip tuntap add mode tap tap0 && echo OK' ioctl(TUNSETIFF): Operation not permitted ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-17 14:22 ` Eric W. Biederman 2024-05-17 18:02 ` Jonathan Calmels @ 2024-05-21 15:52 ` John Johansen 1 sibling, 0 replies; 53+ messages in thread From: John Johansen @ 2024-05-21 15:52 UTC (permalink / raw To: Eric W. Biederman, Jonathan Calmels Cc: brauner, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On 5/17/24 07:22, Eric W. Biederman wrote: > Jonathan Calmels <jcalmels@3xx0.net> writes: > >> On Fri, May 17, 2024 at 06:32:46AM GMT, Eric W. Biederman wrote: >>> >>> Pointers please? >>> >>> That sentence sounds about 5 years out of date. >> >> The link referenced is from last year. >> Here are some others often cited by distributions: >> >> https://nvd.nist.gov/vuln/detail/CVE-2022-0185 >> https://nvd.nist.gov/vuln/detail/CVE-2022-1015 >> https://nvd.nist.gov/vuln/detail/CVE-2022-2078 >> https://nvd.nist.gov/vuln/detail/CVE-2022-24122 >> https://nvd.nist.gov/vuln/detail/CVE-2022-25636 >> >> Recent thread discussing this too: >> https://seclists.org/oss-sec/2024/q2/128 > > My apologies perhaps I trimmed too much. > > I know that user namespaces enlarge the attack surface. > How much and how serious could be debated but for unprivileged > users the attack surface is undoubtedly enlarged. > > As I read your introduction you were justifying the introduction > of a new security mechanism with the observation that distributions > were carrying distribution specific patches. > > To the best of my knowledge distribution specific patches and > distributions disabling user namespaces have been gone for quite a > while. So if that has changed recently I would like to know. > almost all the distros are carrying the out of try sysctl to disable user namepsaces. Its disabled by default but is available. Ubuntu in its 24.04 release is now limiting unprivileged use of user namespaces to known code. At a generic code level they are allowed but with no capabilities within the user namespace. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-16 9:22 ` [PATCH 1/3] capabilities: " Jonathan Calmels ` (2 preceding siblings ...) 2024-05-17 11:32 ` Eric W. Biederman @ 2024-05-20 3:30 ` Serge E. Hallyn 2024-05-20 3:36 ` Serge E. Hallyn 4 siblings, 0 replies; 53+ messages in thread From: Serge E. Hallyn @ 2024-05-20 3:30 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu, May 16, 2024 at 02:22:03AM -0700, Jonathan Calmels wrote: > Attackers often rely on user namespaces to get elevated (yet confined) > privileges in order to target specific subsystems (e.g. [1]). Distributions > have been pretty adamant that they need a way to configure these, most of > them carry out-of-tree patches to do so, or plainly refuse to enable them. > As a result, there have been multiple efforts over the years to introduce > various knobs to control and/or disable user namespaces (e.g. [2][3][4]). > > While we acknowledge that there are already ways to control the creation of > such namespaces (the most recent being a LSM hook), there are inherent > issues with these approaches. Preventing the user namespace creation is not > fine-grained enough, and in some cases, incompatible with various userspace > expectations (e.g. container runtimes, browser sandboxing, service > isolation) > > This patch addresses these limitations by introducing an additional > capability set used to restrict the permissions granted when creating user > namespaces. This way, processes can apply the principle of least privilege > by configuring only the capabilities they need for their namespaces. > > For compatibility reasons, processes always start with a full userns > capability set. > > On namespace creation, the userns capability set (pU) is assigned to the > new effective (pE), permitted (pP) and bounding set (X) of the task: > > pU = pE = pP = X > > The userns capability set obeys the invariant that no bit can ever be set > if it is not already part of the task’s bounding set. This ensures that no > namespace can ever gain more privileges than its predecessors. > Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit > in the userns set requires its corresponding bit to be set in the permitted > set. This effectively mimics the inheritable set rules and means that, by > default, only root in the initial user namespace can gain userns > capabilities: > > p’U = (pE & CAP_SETPCAP) ? X : (X & pP) > > Note that since userns capabilities are strictly hierarchical, policies can > be enforced at various levels (e.g. init, pam_cap) and inherited by every > child namespace. > > Here is a sample program that can be used to verify the functionality: > > /* > * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. > * > * ./cap_userns_test unshare -r grep Cap /proc/self/status > * CapInh: 0000000000000000 > * CapPrm: 000001fffffdffff > * CapEff: 000001fffffdffff > * CapBnd: 000001fffffdffff > * CapAmb: 0000000000000000 > * CapUNs: 000001fffffdffff > */ > > int main(int argc, char *argv[]) > { > if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) > err(1, "cannot drop userns cap"); > > execvp(argv[1], argv + 1); > err(1, "cannot exec"); > } > > Link: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html > Link: https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chromium.org > Link: https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com > Link: https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@git.sr.ht > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> Thanks! Of course we'llnsee how the conversations fall out, but Reviewed-by: Serge Hallyn <serge@hallyn.com> > --- > fs/proc/array.c | 9 ++++++ > include/linux/cred.h | 3 ++ > include/uapi/linux/prctl.h | 7 +++++ > kernel/cred.c | 3 ++ > kernel/umh.c | 16 ++++++++++ > kernel/user_namespace.c | 12 +++----- > security/commoncap.c | 59 ++++++++++++++++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 8 files changed, 105 insertions(+), 7 deletions(-) > > diff --git a/fs/proc/array.c b/fs/proc/array.c > index 34a47fb0c57f..364e8bb19f9d 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > const struct cred *cred; > kernel_cap_t cap_inheritable, cap_permitted, cap_effective, > cap_bset, cap_ambient; > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; > +#endif > > rcu_read_lock(); > cred = __task_cred(p); > @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > cap_effective = cred->cap_effective; > cap_bset = cred->cap_bset; > cap_ambient = cred->cap_ambient; > +#ifdef CONFIG_USER_NS > + cap_userns = cred->cap_userns; > +#endif > rcu_read_unlock(); > > render_cap_t(m, "CapInh:\t", &cap_inheritable); > @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > render_cap_t(m, "CapEff:\t", &cap_effective); > render_cap_t(m, "CapBnd:\t", &cap_bset); > render_cap_t(m, "CapAmb:\t", &cap_ambient); > +#ifdef CONFIG_USER_NS > + render_cap_t(m, "CapUNs:\t", &cap_userns); > +#endif > } > > static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > diff --git a/include/linux/cred.h b/include/linux/cred.h > index 2976f534a7a3..adab0031443e 100644 > --- a/include/linux/cred.h > +++ b/include/linux/cred.h > @@ -124,6 +124,9 @@ struct cred { > kernel_cap_t cap_effective; /* caps we can actually use */ > kernel_cap_t cap_bset; /* capability bounding set */ > kernel_cap_t cap_ambient; /* Ambient capability set */ > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; /* User namespace capability set */ > +#endif > #ifdef CONFIG_KEYS > unsigned char jit_keyring; /* default keyring to attach requested > * keys to */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 370ed14b1ae0..e09475171f62 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -198,6 +198,13 @@ struct prctl_mm_map { > # define PR_CAP_AMBIENT_LOWER 3 > # define PR_CAP_AMBIENT_CLEAR_ALL 4 > > +/* Control the userns capability set */ > +#define PR_CAP_USERNS 48 > +# define PR_CAP_USERNS_IS_SET 1 > +# define PR_CAP_USERNS_RAISE 2 > +# define PR_CAP_USERNS_LOWER 3 > +# define PR_CAP_USERNS_CLEAR_ALL 4 > + > /* arm64 Scalable Vector Extension controls */ > /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ > #define PR_SVE_SET_VL 50 /* set task vector length */ > diff --git a/kernel/cred.c b/kernel/cred.c > index 075cfa7c896f..9912c6f3bc6b 100644 > --- a/kernel/cred.c > +++ b/kernel/cred.c > @@ -56,6 +56,9 @@ struct cred init_cred = { > .cap_permitted = CAP_FULL_SET, > .cap_effective = CAP_FULL_SET, > .cap_bset = CAP_FULL_SET, > +#ifdef CONFIG_USER_NS > + .cap_userns = CAP_FULL_SET, > +#endif > .user = INIT_USER, > .user_ns = &init_user_ns, > .group_info = &init_groups, > diff --git a/kernel/umh.c b/kernel/umh.c > index 1b13c5d34624..51f1e1d25d49 100644 > --- a/kernel/umh.c > +++ b/kernel/umh.c > @@ -32,6 +32,9 @@ > > #include <trace/events/module.h> > > +#ifdef CONFIG_USER_NS > +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; > +#endif > static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; > static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; > static DEFINE_SPINLOCK(umh_sysctl_lock); > @@ -94,6 +97,10 @@ static int call_usermodehelper_exec_async(void *data) > new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); > new->cap_inheritable = cap_intersect(usermodehelper_inheritable, > new->cap_inheritable); > +#ifdef CONFIG_USER_NS > + new->cap_userns = cap_intersect(usermodehelper_userns, > + new->cap_userns); > +#endif > spin_unlock(&umh_sysctl_lock); > > if (sub_info->init) { > @@ -560,6 +567,15 @@ static struct ctl_table usermodehelper_table[] = { > .mode = 0600, > .proc_handler = proc_cap_handler, > }, > +#ifdef CONFIG_USER_NS > + { > + .procname = "userns", > + .data = &usermodehelper_userns, > + .maxlen = 2 * sizeof(unsigned long), > + .mode = 0600, > + .proc_handler = proc_cap_handler, > + }, > +#endif > { } > }; > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 0b0b95418b16..7e624607330b 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) > > static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > { > - /* Start with the same capabilities as init but useless for doing > - * anything as the capabilities are bound to the new user namespace. > - */ > - cred->securebits = SECUREBITS_DEFAULT; > + /* Start with the capabilities defined in the userns set. */ > + cred->cap_bset = cred->cap_userns; > + cred->cap_permitted = cred->cap_userns; > + cred->cap_effective = cred->cap_userns; > cred->cap_inheritable = CAP_EMPTY_SET; > - cred->cap_permitted = CAP_FULL_SET; > - cred->cap_effective = CAP_FULL_SET; > cred->cap_ambient = CAP_EMPTY_SET; > - cred->cap_bset = CAP_FULL_SET; > + cred->securebits = SECUREBITS_DEFAULT; > #ifdef CONFIG_KEYS > key_put(cred->request_key_auth); > cred->request_key_auth = NULL; > diff --git a/security/commoncap.c b/security/commoncap.c > index 162d96b3a676..b3d3372bf910 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -228,6 +228,28 @@ static inline int cap_inh_is_capped(void) > return 1; > } > > +/* > + * Determine whether a userns capability can be raised. > + * Returns 1 if it can, 0 otherwise. > + */ > +#ifdef CONFIG_USER_NS > +static inline int cap_uns_is_raiseable(unsigned long cap) > +{ > + if (!!cap_raised(current_cred()->cap_userns, cap)) > + return 1; > + /* a capability cannot be raised unless the current task has it in > + * its bounding set and, without CAP_SETPCAP, its permitted set. > + */ > + if (!cap_raised(current_cred()->cap_bset, cap)) > + return 0; > + if (cap_capable(current_cred(), current_cred()->user_ns, > + CAP_SETPCAP, CAP_OPT_NONE) != 0 && > + !cap_raised(current_cred()->cap_permitted, cap)) > + return 0; > + return 1; > +} > +#endif > + > /** > * cap_capset - Validate and apply proposed changes to current's capabilities > * @new: The proposed new credentials; alterations should be made here > @@ -1382,6 +1404,43 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, > return commit_creds(new); > } > > +#ifdef CONFIG_USER_NS > + case PR_CAP_USERNS: > + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { > + if (arg3 | arg4 | arg5) > + return -EINVAL; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + cap_clear(new->cap_userns); > + return commit_creds(new); > + } > + > + if (((!cap_valid(arg3)) | arg4 | arg5)) > + return -EINVAL; > + > + if (arg2 == PR_CAP_USERNS_IS_SET) { > + return !!cap_raised(current_cred()->cap_userns, arg3); > + } else if (arg2 != PR_CAP_USERNS_RAISE && > + arg2 != PR_CAP_USERNS_LOWER) { > + return -EINVAL; > + } else { > + if (arg2 == PR_CAP_USERNS_RAISE && > + !cap_uns_is_raiseable(arg3)) > + return -EPERM; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + if (arg2 == PR_CAP_USERNS_RAISE) > + cap_raise(new->cap_userns, arg3); > + else > + cap_lower(new->cap_userns, arg3); > + return commit_creds(new); > + } > +#endif > + > default: > /* No functionality available - continue with default */ > return -ENOSYS; > diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c > index b5d5333ab330..e3670d815435 100644 > --- a/security/keys/process_keys.c > +++ b/security/keys/process_keys.c > @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) > new->cap_effective = old->cap_effective; > new->cap_ambient = old->cap_ambient; > new->cap_bset = old->cap_bset; > +#ifdef CONFIG_USER_NS > + new->cap_userns = old->cap_userns; > +#endif > > new->jit_keyring = old->jit_keyring; > new->thread_keyring = key_get(old->thread_keyring); > -- > 2.45.0 > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 1/3] capabilities: user namespace capabilities 2024-05-16 9:22 ` [PATCH 1/3] capabilities: " Jonathan Calmels ` (3 preceding siblings ...) 2024-05-20 3:30 ` Serge E. Hallyn @ 2024-05-20 3:36 ` Serge E. Hallyn 4 siblings, 0 replies; 53+ messages in thread From: Serge E. Hallyn @ 2024-05-20 3:36 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu, May 16, 2024 at 02:22:03AM -0700, Jonathan Calmels wrote: > Attackers often rely on user namespaces to get elevated (yet confined) > privileges in order to target specific subsystems (e.g. [1]). Distributions > have been pretty adamant that they need a way to configure these, most of > them carry out-of-tree patches to do so, or plainly refuse to enable them. > As a result, there have been multiple efforts over the years to introduce > various knobs to control and/or disable user namespaces (e.g. [2][3][4]). > > While we acknowledge that there are already ways to control the creation of > such namespaces (the most recent being a LSM hook), there are inherent > issues with these approaches. Preventing the user namespace creation is not > fine-grained enough, and in some cases, incompatible with various userspace > expectations (e.g. container runtimes, browser sandboxing, service > isolation) > > This patch addresses these limitations by introducing an additional > capability set used to restrict the permissions granted when creating user > namespaces. This way, processes can apply the principle of least privilege > by configuring only the capabilities they need for their namespaces. > > For compatibility reasons, processes always start with a full userns > capability set. > > On namespace creation, the userns capability set (pU) is assigned to the > new effective (pE), permitted (pP) and bounding set (X) of the task: > > pU = pE = pP = X > > The userns capability set obeys the invariant that no bit can ever be set > if it is not already part of the task’s bounding set. This ensures that no > namespace can ever gain more privileges than its predecessors. > Additionally, if a task is not privileged over CAP_SETPCAP, setting any bit > in the userns set requires its corresponding bit to be set in the permitted > set. This effectively mimics the inheritable set rules and means that, by > default, only root in the initial user namespace can gain userns > capabilities: > > p’U = (pE & CAP_SETPCAP) ? X : (X & pP) > > Note that since userns capabilities are strictly hierarchical, policies can > be enforced at various levels (e.g. init, pam_cap) and inherited by every > child namespace. > > Here is a sample program that can be used to verify the functionality: > > /* > * Test program that drops CAP_SYS_RAWIO from subsequent user namespaces. > * > * ./cap_userns_test unshare -r grep Cap /proc/self/status > * CapInh: 0000000000000000 > * CapPrm: 000001fffffdffff > * CapEff: 000001fffffdffff > * CapBnd: 000001fffffdffff > * CapAmb: 0000000000000000 > * CapUNs: 000001fffffdffff > */ > > int main(int argc, char *argv[]) > { > if (prctl(PR_CAP_USERNS, PR_CAP_USERNS_LOWER, CAP_SYS_RAWIO, 0, 0) < 0) > err(1, "cannot drop userns cap"); > > execvp(argv[1], argv + 1); > err(1, "cannot exec"); > } > > Link: https://security.googleblog.com/2023/06/learnings-from-kctf-vrps-42-linux.html > Link: https://lore.kernel.org/lkml/1453502345-30416-1-git-send-email-keescook@chromium.org > Link: https://lore.kernel.org/lkml/20220815162028.926858-1-fred@cloudflare.com > Link: https://lore.kernel.org/containers/168547265011.24337.4306067683997517082-0@git.sr.ht > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> > --- > fs/proc/array.c | 9 ++++++ > include/linux/cred.h | 3 ++ > include/uapi/linux/prctl.h | 7 +++++ > kernel/cred.c | 3 ++ > kernel/umh.c | 16 ++++++++++ > kernel/user_namespace.c | 12 +++----- > security/commoncap.c | 59 ++++++++++++++++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 8 files changed, 105 insertions(+), 7 deletions(-) > > diff --git a/fs/proc/array.c b/fs/proc/array.c > index 34a47fb0c57f..364e8bb19f9d 100644 > --- a/fs/proc/array.c > +++ b/fs/proc/array.c > @@ -313,6 +313,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > const struct cred *cred; > kernel_cap_t cap_inheritable, cap_permitted, cap_effective, > cap_bset, cap_ambient; > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; > +#endif > > rcu_read_lock(); > cred = __task_cred(p); > @@ -321,6 +324,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > cap_effective = cred->cap_effective; > cap_bset = cred->cap_bset; > cap_ambient = cred->cap_ambient; > +#ifdef CONFIG_USER_NS > + cap_userns = cred->cap_userns; > +#endif > rcu_read_unlock(); > > render_cap_t(m, "CapInh:\t", &cap_inheritable); > @@ -328,6 +334,9 @@ static inline void task_cap(struct seq_file *m, struct task_struct *p) > render_cap_t(m, "CapEff:\t", &cap_effective); > render_cap_t(m, "CapBnd:\t", &cap_bset); > render_cap_t(m, "CapAmb:\t", &cap_ambient); > +#ifdef CONFIG_USER_NS > + render_cap_t(m, "CapUNs:\t", &cap_userns); > +#endif > } > > static inline void task_seccomp(struct seq_file *m, struct task_struct *p) > diff --git a/include/linux/cred.h b/include/linux/cred.h > index 2976f534a7a3..adab0031443e 100644 > --- a/include/linux/cred.h > +++ b/include/linux/cred.h > @@ -124,6 +124,9 @@ struct cred { > kernel_cap_t cap_effective; /* caps we can actually use */ > kernel_cap_t cap_bset; /* capability bounding set */ > kernel_cap_t cap_ambient; /* Ambient capability set */ > +#ifdef CONFIG_USER_NS > + kernel_cap_t cap_userns; /* User namespace capability set */ > +#endif > #ifdef CONFIG_KEYS > unsigned char jit_keyring; /* default keyring to attach requested > * keys to */ > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h > index 370ed14b1ae0..e09475171f62 100644 > --- a/include/uapi/linux/prctl.h > +++ b/include/uapi/linux/prctl.h > @@ -198,6 +198,13 @@ struct prctl_mm_map { > # define PR_CAP_AMBIENT_LOWER 3 > # define PR_CAP_AMBIENT_CLEAR_ALL 4 > > +/* Control the userns capability set */ > +#define PR_CAP_USERNS 48 > +# define PR_CAP_USERNS_IS_SET 1 > +# define PR_CAP_USERNS_RAISE 2 > +# define PR_CAP_USERNS_LOWER 3 > +# define PR_CAP_USERNS_CLEAR_ALL 4 > + > /* arm64 Scalable Vector Extension controls */ > /* Flag values must be kept in sync with ptrace NT_ARM_SVE interface */ > #define PR_SVE_SET_VL 50 /* set task vector length */ > diff --git a/kernel/cred.c b/kernel/cred.c > index 075cfa7c896f..9912c6f3bc6b 100644 > --- a/kernel/cred.c > +++ b/kernel/cred.c > @@ -56,6 +56,9 @@ struct cred init_cred = { > .cap_permitted = CAP_FULL_SET, > .cap_effective = CAP_FULL_SET, > .cap_bset = CAP_FULL_SET, > +#ifdef CONFIG_USER_NS > + .cap_userns = CAP_FULL_SET, > +#endif > .user = INIT_USER, > .user_ns = &init_user_ns, > .group_info = &init_groups, > diff --git a/kernel/umh.c b/kernel/umh.c > index 1b13c5d34624..51f1e1d25d49 100644 > --- a/kernel/umh.c > +++ b/kernel/umh.c > @@ -32,6 +32,9 @@ > > #include <trace/events/module.h> > > +#ifdef CONFIG_USER_NS > +static kernel_cap_t usermodehelper_userns = CAP_FULL_SET; > +#endif > static kernel_cap_t usermodehelper_bset = CAP_FULL_SET; > static kernel_cap_t usermodehelper_inheritable = CAP_FULL_SET; > static DEFINE_SPINLOCK(umh_sysctl_lock); > @@ -94,6 +97,10 @@ static int call_usermodehelper_exec_async(void *data) > new->cap_bset = cap_intersect(usermodehelper_bset, new->cap_bset); > new->cap_inheritable = cap_intersect(usermodehelper_inheritable, > new->cap_inheritable); > +#ifdef CONFIG_USER_NS > + new->cap_userns = cap_intersect(usermodehelper_userns, > + new->cap_userns); > +#endif > spin_unlock(&umh_sysctl_lock); > > if (sub_info->init) { > @@ -560,6 +567,15 @@ static struct ctl_table usermodehelper_table[] = { > .mode = 0600, > .proc_handler = proc_cap_handler, > }, > +#ifdef CONFIG_USER_NS > + { > + .procname = "userns", > + .data = &usermodehelper_userns, > + .maxlen = 2 * sizeof(unsigned long), > + .mode = 0600, > + .proc_handler = proc_cap_handler, > + }, > +#endif > { } > }; > > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 0b0b95418b16..7e624607330b 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -42,15 +42,13 @@ static void dec_user_namespaces(struct ucounts *ucounts) > > static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > { > - /* Start with the same capabilities as init but useless for doing > - * anything as the capabilities are bound to the new user namespace. > - */ > - cred->securebits = SECUREBITS_DEFAULT; > + /* Start with the capabilities defined in the userns set. */ > + cred->cap_bset = cred->cap_userns; > + cred->cap_permitted = cred->cap_userns; > + cred->cap_effective = cred->cap_userns; > cred->cap_inheritable = CAP_EMPTY_SET; > - cred->cap_permitted = CAP_FULL_SET; > - cred->cap_effective = CAP_FULL_SET; > cred->cap_ambient = CAP_EMPTY_SET; > - cred->cap_bset = CAP_FULL_SET; > + cred->securebits = SECUREBITS_DEFAULT; > #ifdef CONFIG_KEYS > key_put(cred->request_key_auth); > cred->request_key_auth = NULL; > diff --git a/security/commoncap.c b/security/commoncap.c > index 162d96b3a676..b3d3372bf910 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -228,6 +228,28 @@ static inline int cap_inh_is_capped(void) > return 1; > } > > +/* > + * Determine whether a userns capability can be raised. > + * Returns 1 if it can, 0 otherwise. > + */ > +#ifdef CONFIG_USER_NS > +static inline int cap_uns_is_raiseable(unsigned long cap) > +{ > + if (!!cap_raised(current_cred()->cap_userns, cap)) > + return 1; > + /* a capability cannot be raised unless the current task has it in > + * its bounding set and, without CAP_SETPCAP, its permitted set. > + */ > + if (!cap_raised(current_cred()->cap_bset, cap)) > + return 0; > + if (cap_capable(current_cred(), current_cred()->user_ns, > + CAP_SETPCAP, CAP_OPT_NONE) != 0 && > + !cap_raised(current_cred()->cap_permitted, cap)) > + return 0; > + return 1; > +} > +#endif > + > /** > * cap_capset - Validate and apply proposed changes to current's capabilities > * @new: The proposed new credentials; alterations should be made here > @@ -1382,6 +1404,43 @@ int cap_task_prctl(int option, unsigned long arg2, unsigned long arg3, > return commit_creds(new); > } > > +#ifdef CONFIG_USER_NS > + case PR_CAP_USERNS: > + if (arg2 == PR_CAP_USERNS_CLEAR_ALL) { > + if (arg3 | arg4 | arg5) > + return -EINVAL; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + cap_clear(new->cap_userns); > + return commit_creds(new); > + } > + > + if (((!cap_valid(arg3)) | arg4 | arg5)) > + return -EINVAL; > + > + if (arg2 == PR_CAP_USERNS_IS_SET) { > + return !!cap_raised(current_cred()->cap_userns, arg3); > + } else if (arg2 != PR_CAP_USERNS_RAISE && > + arg2 != PR_CAP_USERNS_LOWER) { > + return -EINVAL; > + } else { Sorry, I meabt to say, one nit would be that this next block does not need to be in an else, since every other condition returns. > + if (arg2 == PR_CAP_USERNS_RAISE && > + !cap_uns_is_raiseable(arg3)) > + return -EPERM; > + > + new = prepare_creds(); > + if (!new) > + return -ENOMEM; > + if (arg2 == PR_CAP_USERNS_RAISE) > + cap_raise(new->cap_userns, arg3); > + else > + cap_lower(new->cap_userns, arg3); > + return commit_creds(new); > + } > +#endif > + > default: > /* No functionality available - continue with default */ > return -ENOSYS; > diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c > index b5d5333ab330..e3670d815435 100644 > --- a/security/keys/process_keys.c > +++ b/security/keys/process_keys.c > @@ -944,6 +944,9 @@ void key_change_session_keyring(struct callback_head *twork) > new->cap_effective = old->cap_effective; > new->cap_ambient = old->cap_ambient; > new->cap_bset = old->cap_bset; > +#ifdef CONFIG_USER_NS > + new->cap_userns = old->cap_userns; > +#endif > > new->jit_keyring = old->jit_keyring; > new->thread_keyring = key_get(old->thread_keyring); > -- > 2.45.0 > ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 2/3] capabilities: add securebit for strict userns caps 2024-05-16 9:22 [PATCH 0/3] Introduce user namespace capabilities Jonathan Calmels 2024-05-16 9:22 ` [PATCH 1/3] capabilities: " Jonathan Calmels @ 2024-05-16 9:22 ` Jonathan Calmels 2024-05-16 12:42 ` Jarkko Sakkinen 2024-05-20 3:38 ` Serge E. Hallyn 2024-05-16 9:22 ` [PATCH 3/3] capabilities: add cap userns sysctl mask Jonathan Calmels ` (3 subsequent siblings) 5 siblings, 2 replies; 53+ messages in thread From: Jonathan Calmels @ 2024-05-16 9:22 UTC (permalink / raw To: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen Cc: containers, Jonathan Calmels, linux-kernel, linux-fsdevel, linux-security-module, keyrings This patch adds a new capability security bit designed to constrain a task’s userns capability set to its bounding set. The reason for this is twofold: - This serves as a quick and easy way to lock down a set of capabilities for a task, thus ensuring that any namespace it creates will never be more privileged than itself is. - This helps userspace transition to more secure defaults by not requiring specific logic for the userns capability set, or libcap support. Example: # capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \ -c 'unshare -r grep Cap /proc/self/status' CapInh: 0000000000000000 CapPrm: 000001fffffdffff CapEff: 000001fffffdffff CapBnd: 000001fffffdffff CapAmb: 0000000000000000 CapUNs: 000001fffffdffff Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> --- include/linux/securebits.h | 1 + include/uapi/linux/securebits.h | 11 ++++++++++- kernel/user_namespace.c | 5 +++++ 3 files changed, 16 insertions(+), 1 deletion(-) diff --git a/include/linux/securebits.h b/include/linux/securebits.h index 656528673983..5f9d85cd69c3 100644 --- a/include/linux/securebits.h +++ b/include/linux/securebits.h @@ -5,4 +5,5 @@ #include <uapi/linux/securebits.h> #define issecure(X) (issecure_mask(X) & current_cred_xxx(securebits)) +#define iscredsecure(cred, X) (issecure_mask(X) & cred->securebits) #endif /* !_LINUX_SECUREBITS_H */ diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h index d6d98877ff1a..2da3f4be4531 100644 --- a/include/uapi/linux/securebits.h +++ b/include/uapi/linux/securebits.h @@ -52,10 +52,19 @@ #define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \ (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_LOCKED)) +/* When set, user namespace capabilities are restricted to their parent's bounding set. */ +#define SECURE_USERNS_STRICT_CAPS 8 +#define SECURE_USERNS_STRICT_CAPS_LOCKED 9 /* make bit-8 immutable */ + +#define SECBIT_USERNS_STRICT_CAPS (issecure_mask(SECURE_USERNS_STRICT_CAPS)) +#define SECBIT_USERNS_STRICT_CAPS_LOCKED \ + (issecure_mask(SECURE_USERNS_STRICT_CAPS_LOCKED)) + #define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \ issecure_mask(SECURE_NO_SETUID_FIXUP) | \ issecure_mask(SECURE_KEEP_CAPS) | \ - issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE)) + issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \ + issecure_mask(SECURE_USERNS_STRICT_CAPS)) #define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1) #endif /* _UAPI_LINUX_SECUREBITS_H */ diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 7e624607330b..53848e2b68cd 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -10,6 +10,7 @@ #include <linux/cred.h> #include <linux/securebits.h> #include <linux/security.h> +#include <linux/capability.h> #include <linux/keyctl.h> #include <linux/key-type.h> #include <keys/user-type.h> @@ -42,6 +43,10 @@ static void dec_user_namespaces(struct ucounts *ucounts) static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) { + /* Limit userns capabilities to our parent's bounding set. */ + if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS)) + cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset); + /* Start with the capabilities defined in the userns set. */ cred->cap_bset = cred->cap_userns; cred->cap_permitted = cred->cap_userns; -- 2.45.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 2/3] capabilities: add securebit for strict userns caps 2024-05-16 9:22 ` [PATCH 2/3] capabilities: add securebit for strict userns caps Jonathan Calmels @ 2024-05-16 12:42 ` Jarkko Sakkinen 2024-05-20 3:38 ` Serge E. Hallyn 1 sibling, 0 replies; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-16 12:42 UTC (permalink / raw To: Jonathan Calmels, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells Cc: containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings Maintainer dependent but at least on x86 patches people tend to prefer capital letter in the short summary i.e. s/add/Add/ On Thu May 16, 2024 at 12:22 PM EEST, Jonathan Calmels wrote: > This patch adds a new capability security bit designed to constrain a > task’s userns capability set to its bounding set. The reason for this is > twofold: > > - This serves as a quick and easy way to lock down a set of capabilities > for a task, thus ensuring that any namespace it creates will never be > more privileged than itself is. > - This helps userspace transition to more secure defaults by not requiring > specific logic for the userns capability set, or libcap support. > > Example: > > # capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \ > -c 'unshare -r grep Cap /proc/self/status' > CapInh: 0000000000000000 > CapPrm: 000001fffffdffff > CapEff: 000001fffffdffff > CapBnd: 000001fffffdffff > CapAmb: 0000000000000000 > CapUNs: 000001fffffdffff > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> > --- > include/linux/securebits.h | 1 + > include/uapi/linux/securebits.h | 11 ++++++++++- > kernel/user_namespace.c | 5 +++++ > 3 files changed, 16 insertions(+), 1 deletion(-) > > diff --git a/include/linux/securebits.h b/include/linux/securebits.h > index 656528673983..5f9d85cd69c3 100644 > --- a/include/linux/securebits.h > +++ b/include/linux/securebits.h > @@ -5,4 +5,5 @@ > #include <uapi/linux/securebits.h> > > #define issecure(X) (issecure_mask(X) & current_cred_xxx(securebits)) > +#define iscredsecure(cred, X) (issecure_mask(X) & cred->securebits) > #endif /* !_LINUX_SECUREBITS_H */ > diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h > index d6d98877ff1a..2da3f4be4531 100644 > --- a/include/uapi/linux/securebits.h > +++ b/include/uapi/linux/securebits.h > @@ -52,10 +52,19 @@ > #define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \ > (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_LOCKED)) > > +/* When set, user namespace capabilities are restricted to their parent's bounding set. */ > +#define SECURE_USERNS_STRICT_CAPS 8 > +#define SECURE_USERNS_STRICT_CAPS_LOCKED 9 /* make bit-8 immutable */ > + > +#define SECBIT_USERNS_STRICT_CAPS (issecure_mask(SECURE_USERNS_STRICT_CAPS)) > +#define SECBIT_USERNS_STRICT_CAPS_LOCKED \ > + (issecure_mask(SECURE_USERNS_STRICT_CAPS_LOCKED)) > + > #define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \ > issecure_mask(SECURE_NO_SETUID_FIXUP) | \ > issecure_mask(SECURE_KEEP_CAPS) | \ > - issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE)) > + issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \ > + issecure_mask(SECURE_USERNS_STRICT_CAPS)) > #define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1) > > #endif /* _UAPI_LINUX_SECUREBITS_H */ > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 7e624607330b..53848e2b68cd 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -10,6 +10,7 @@ > #include <linux/cred.h> > #include <linux/securebits.h> > #include <linux/security.h> > +#include <linux/capability.h> > #include <linux/keyctl.h> > #include <linux/key-type.h> > #include <keys/user-type.h> > @@ -42,6 +43,10 @@ static void dec_user_namespaces(struct ucounts *ucounts) > > static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > { > + /* Limit userns capabilities to our parent's bounding set. */ > + if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS)) > + cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset); > + > /* Start with the capabilities defined in the userns set. */ > cred->cap_bset = cred->cap_userns; > cred->cap_permitted = cred->cap_userns; BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 2/3] capabilities: add securebit for strict userns caps 2024-05-16 9:22 ` [PATCH 2/3] capabilities: add securebit for strict userns caps Jonathan Calmels 2024-05-16 12:42 ` Jarkko Sakkinen @ 2024-05-20 3:38 ` Serge E. Hallyn 1 sibling, 0 replies; 53+ messages in thread From: Serge E. Hallyn @ 2024-05-20 3:38 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu, May 16, 2024 at 02:22:04AM -0700, Jonathan Calmels wrote: > This patch adds a new capability security bit designed to constrain a > task’s userns capability set to its bounding set. The reason for this is > twofold: > > - This serves as a quick and easy way to lock down a set of capabilities > for a task, thus ensuring that any namespace it creates will never be > more privileged than itself is. > - This helps userspace transition to more secure defaults by not requiring > specific logic for the userns capability set, or libcap support. > > Example: > > # capsh --secbits=$((1 << 8)) --drop=cap_sys_rawio -- \ > -c 'unshare -r grep Cap /proc/self/status' > CapInh: 0000000000000000 > CapPrm: 000001fffffdffff > CapEff: 000001fffffdffff > CapBnd: 000001fffffdffff > CapAmb: 0000000000000000 > CapUNs: 000001fffffdffff > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> Reviewed-by: Serge Hallyn <serge@hallyn.com> > --- > include/linux/securebits.h | 1 + > include/uapi/linux/securebits.h | 11 ++++++++++- > kernel/user_namespace.c | 5 +++++ > 3 files changed, 16 insertions(+), 1 deletion(-) > > diff --git a/include/linux/securebits.h b/include/linux/securebits.h > index 656528673983..5f9d85cd69c3 100644 > --- a/include/linux/securebits.h > +++ b/include/linux/securebits.h > @@ -5,4 +5,5 @@ > #include <uapi/linux/securebits.h> > > #define issecure(X) (issecure_mask(X) & current_cred_xxx(securebits)) > +#define iscredsecure(cred, X) (issecure_mask(X) & cred->securebits) > #endif /* !_LINUX_SECUREBITS_H */ > diff --git a/include/uapi/linux/securebits.h b/include/uapi/linux/securebits.h > index d6d98877ff1a..2da3f4be4531 100644 > --- a/include/uapi/linux/securebits.h > +++ b/include/uapi/linux/securebits.h > @@ -52,10 +52,19 @@ > #define SECBIT_NO_CAP_AMBIENT_RAISE_LOCKED \ > (issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE_LOCKED)) > > +/* When set, user namespace capabilities are restricted to their parent's bounding set. */ > +#define SECURE_USERNS_STRICT_CAPS 8 > +#define SECURE_USERNS_STRICT_CAPS_LOCKED 9 /* make bit-8 immutable */ > + > +#define SECBIT_USERNS_STRICT_CAPS (issecure_mask(SECURE_USERNS_STRICT_CAPS)) > +#define SECBIT_USERNS_STRICT_CAPS_LOCKED \ > + (issecure_mask(SECURE_USERNS_STRICT_CAPS_LOCKED)) > + > #define SECURE_ALL_BITS (issecure_mask(SECURE_NOROOT) | \ > issecure_mask(SECURE_NO_SETUID_FIXUP) | \ > issecure_mask(SECURE_KEEP_CAPS) | \ > - issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE)) > + issecure_mask(SECURE_NO_CAP_AMBIENT_RAISE) | \ > + issecure_mask(SECURE_USERNS_STRICT_CAPS)) > #define SECURE_ALL_LOCKS (SECURE_ALL_BITS << 1) > > #endif /* _UAPI_LINUX_SECUREBITS_H */ > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 7e624607330b..53848e2b68cd 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -10,6 +10,7 @@ > #include <linux/cred.h> > #include <linux/securebits.h> > #include <linux/security.h> > +#include <linux/capability.h> > #include <linux/keyctl.h> > #include <linux/key-type.h> > #include <keys/user-type.h> > @@ -42,6 +43,10 @@ static void dec_user_namespaces(struct ucounts *ucounts) > > static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > { > + /* Limit userns capabilities to our parent's bounding set. */ > + if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS)) > + cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset); > + > /* Start with the capabilities defined in the userns set. */ > cred->cap_bset = cred->cap_userns; > cred->cap_permitted = cred->cap_userns; > -- > 2.45.0 > ^ permalink raw reply [flat|nested] 53+ messages in thread
* [PATCH 3/3] capabilities: add cap userns sysctl mask 2024-05-16 9:22 [PATCH 0/3] Introduce user namespace capabilities Jonathan Calmels 2024-05-16 9:22 ` [PATCH 1/3] capabilities: " Jonathan Calmels 2024-05-16 9:22 ` [PATCH 2/3] capabilities: add securebit for strict userns caps Jonathan Calmels @ 2024-05-16 9:22 ` Jonathan Calmels 2024-05-16 12:44 ` Jarkko Sakkinen ` (2 more replies) 2024-05-16 13:30 ` [PATCH 0/3] Introduce user namespace capabilities Ben Boeckel ` (2 subsequent siblings) 5 siblings, 3 replies; 53+ messages in thread From: Jonathan Calmels @ 2024-05-16 9:22 UTC (permalink / raw To: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen Cc: containers, Jonathan Calmels, linux-kernel, linux-fsdevel, linux-security-module, keyrings This patch adds a new system-wide userns capability mask designed to mask off capabilities in user namespaces. This mask is controlled through a sysctl and can be set early in the boot process or on the kernel command line to exclude known capabilities from ever being gained in namespaces. Once set, it can be further restricted to exert dynamic policies on the system (e.g. ward off a potential exploit). Changing this mask requires privileges over CAP_SYS_ADMIN and CAP_SETPCAP in the initial user namespace. Example: # sysctl -qw kernel.cap_userns_mask=0x1fffffdffff && \ unshare -r grep Cap /proc/self/status CapInh: 0000000000000000 CapPrm: 000001fffffdffff CapEff: 000001fffffdffff CapBnd: 000001fffffdffff CapAmb: 0000000000000000 CapUNs: 000001fffffdffff Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> --- include/linux/user_namespace.h | 7 ++++ kernel/sysctl.c | 10 ++++++ kernel/user_namespace.c | 66 ++++++++++++++++++++++++++++++++++ 3 files changed, 83 insertions(+) diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h index 6030a8235617..e3478bd54ee5 100644 --- a/include/linux/user_namespace.h +++ b/include/linux/user_namespace.h @@ -2,6 +2,7 @@ #ifndef _LINUX_USER_NAMESPACE_H #define _LINUX_USER_NAMESPACE_H +#include <linux/capability.h> #include <linux/kref.h> #include <linux/nsproxy.h> #include <linux/ns_common.h> @@ -14,6 +15,12 @@ #define UID_GID_MAP_MAX_BASE_EXTENTS 5 #define UID_GID_MAP_MAX_EXTENTS 340 +#ifdef CONFIG_SYSCTL +extern kernel_cap_t cap_userns_mask; +int proc_cap_userns_handler(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos); +#endif + struct uid_gid_extent { u32 first; u32 lower_first; diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 81cc974913bb..1546eebd6aea 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -62,6 +62,7 @@ #include <linux/sched/sysctl.h> #include <linux/mount.h> #include <linux/userfaultfd_k.h> +#include <linux/user_namespace.h> #include <linux/pid.h> #include "../lib/kstrtox.h" @@ -1846,6 +1847,15 @@ static struct ctl_table kern_table[] = { .mode = 0444, .proc_handler = proc_dointvec, }, +#ifdef CONFIG_USER_NS + { + .procname = "cap_userns_mask", + .data = &cap_userns_mask, + .maxlen = sizeof(kernel_cap_t), + .mode = 0644, + .proc_handler = proc_cap_userns_handler, + }, +#endif #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) { .procname = "unknown_nmi_panic", diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c index 53848e2b68cd..e0cf606e9140 100644 --- a/kernel/user_namespace.c +++ b/kernel/user_namespace.c @@ -26,6 +26,66 @@ static struct kmem_cache *user_ns_cachep __ro_after_init; static DEFINE_MUTEX(userns_state_mutex); +#ifdef CONFIG_SYSCTL +static DEFINE_SPINLOCK(cap_userns_lock); +kernel_cap_t cap_userns_mask = CAP_FULL_SET; + +int proc_cap_userns_handler(struct ctl_table *table, int write, + void *buffer, size_t *lenp, loff_t *ppos) +{ + struct ctl_table t; + unsigned long mask_array[2]; + kernel_cap_t new_mask, *mask; + int err; + + if (write && (!capable(CAP_SETPCAP) || + !capable(CAP_SYS_ADMIN))) + return -EPERM; + + /* + * convert from the global kernel_cap_t to the ulong array to print to + * userspace if this is a read. + * + * capabilities are exposed as one 64-bit value or two 32-bit values + * depending on the architecture + */ + mask = table->data; + spin_lock(&cap_userns_lock); + mask_array[0] = (unsigned long) mask->val; +#if BITS_PER_LONG != 64 + mask_array[1] = mask->val >> BITS_PER_LONG; +#endif + spin_unlock(&cap_userns_lock); + + t = *table; + t.data = &mask_array; + + /* + * actually read or write and array of ulongs from userspace. Remember + * these are least significant bits first + */ + err = proc_doulongvec_minmax(&t, write, buffer, lenp, ppos); + if (err < 0) + return err; + + new_mask.val = mask_array[0]; +#if BITS_PER_LONG != 64 + new_mask.val += (u64)mask_array[1] << BITS_PER_LONG; +#endif + + /* + * Drop everything not in the new_mask (but don't add things) + */ + if (write) { + spin_lock(&cap_userns_lock); + *mask = cap_intersect(*mask, new_mask); + spin_unlock(&cap_userns_lock); + } + + return 0; +} +#endif + static bool new_idmap_permitted(const struct file *file, struct user_namespace *ns, int cap_setid, struct uid_gid_map *map); @@ -46,6 +106,12 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) /* Limit userns capabilities to our parent's bounding set. */ if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS)) cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset); +#ifdef CONFIG_SYSCTL + /* Mask off userns capabilities that are not permitted by the system-wide mask. */ + spin_lock(&cap_userns_lock); + cred->cap_userns = cap_intersect(cred->cap_userns, cap_userns_mask); + spin_unlock(&cap_userns_lock); +#endif /* Start with the capabilities defined in the userns set. */ cred->cap_bset = cred->cap_userns; -- 2.45.0 ^ permalink raw reply related [flat|nested] 53+ messages in thread
* Re: [PATCH 3/3] capabilities: add cap userns sysctl mask 2024-05-16 9:22 ` [PATCH 3/3] capabilities: add cap userns sysctl mask Jonathan Calmels @ 2024-05-16 12:44 ` Jarkko Sakkinen 2024-05-20 3:38 ` Serge E. Hallyn 2024-05-20 13:30 ` Tycho Andersen 2 siblings, 0 replies; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-16 12:44 UTC (permalink / raw To: Jonathan Calmels, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells Cc: containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu May 16, 2024 at 12:22 PM EEST, Jonathan Calmels wrote: > This patch adds a new system-wide userns capability mask designed to mask > off capabilities in user namespaces. > > This mask is controlled through a sysctl and can be set early in the boot > process or on the kernel command line to exclude known capabilities from > ever being gained in namespaces. Once set, it can be further restricted to > exert dynamic policies on the system (e.g. ward off a potential exploit). > > Changing this mask requires privileges over CAP_SYS_ADMIN and CAP_SETPCAP > in the initial user namespace. > > Example: > > # sysctl -qw kernel.cap_userns_mask=0x1fffffdffff && \ > unshare -r grep Cap /proc/self/status > CapInh: 0000000000000000 > CapPrm: 000001fffffdffff > CapEff: 000001fffffdffff > CapBnd: 000001fffffdffff > CapAmb: 0000000000000000 > CapUNs: 000001fffffdffff > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> > --- > include/linux/user_namespace.h | 7 ++++ > kernel/sysctl.c | 10 ++++++ > kernel/user_namespace.c | 66 ++++++++++++++++++++++++++++++++++ > 3 files changed, 83 insertions(+) > > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h > index 6030a8235617..e3478bd54ee5 100644 > --- a/include/linux/user_namespace.h > +++ b/include/linux/user_namespace.h > @@ -2,6 +2,7 @@ > #ifndef _LINUX_USER_NAMESPACE_H > #define _LINUX_USER_NAMESPACE_H > > +#include <linux/capability.h> > #include <linux/kref.h> > #include <linux/nsproxy.h> > #include <linux/ns_common.h> > @@ -14,6 +15,12 @@ > #define UID_GID_MAP_MAX_BASE_EXTENTS 5 > #define UID_GID_MAP_MAX_EXTENTS 340 > > +#ifdef CONFIG_SYSCTL > +extern kernel_cap_t cap_userns_mask; > +int proc_cap_userns_handler(struct ctl_table *table, int write, > + void *buffer, size_t *lenp, loff_t *ppos); > +#endif > + > struct uid_gid_extent { > u32 first; > u32 lower_first; > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 81cc974913bb..1546eebd6aea 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -62,6 +62,7 @@ > #include <linux/sched/sysctl.h> > #include <linux/mount.h> > #include <linux/userfaultfd_k.h> > +#include <linux/user_namespace.h> > #include <linux/pid.h> > > #include "../lib/kstrtox.h" > @@ -1846,6 +1847,15 @@ static struct ctl_table kern_table[] = { > .mode = 0444, > .proc_handler = proc_dointvec, > }, > +#ifdef CONFIG_USER_NS > + { > + .procname = "cap_userns_mask", > + .data = &cap_userns_mask, > + .maxlen = sizeof(kernel_cap_t), > + .mode = 0644, > + .proc_handler = proc_cap_userns_handler, > + }, > +#endif > #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) > { > .procname = "unknown_nmi_panic", > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 53848e2b68cd..e0cf606e9140 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -26,6 +26,66 @@ > static struct kmem_cache *user_ns_cachep __ro_after_init; > static DEFINE_MUTEX(userns_state_mutex); > > +#ifdef CONFIG_SYSCTL > +static DEFINE_SPINLOCK(cap_userns_lock); Generally new global or file-local locks are better to have a comment that describes their use. > +kernel_cap_t cap_userns_mask = CAP_FULL_SET; > + Non-static symbol should have appropriate kdoc with alll arguments and return values documented. > +int proc_cap_userns_handler(struct ctl_table *table, int write, > + void *buffer, size_t *lenp, loff_t *ppos) > +{ > + struct ctl_table t; > + unsigned long mask_array[2]; > + kernel_cap_t new_mask, *mask; > + int err; > + > + if (write && (!capable(CAP_SETPCAP) || > + !capable(CAP_SYS_ADMIN))) > + return -EPERM; > + > + /* > + * convert from the global kernel_cap_t to the ulong array to print to > + * userspace if this is a read. > + * > + * capabilities are exposed as one 64-bit value or two 32-bit values > + * depending on the architecture > + */ > + mask = table->data; > + spin_lock(&cap_userns_lock); > + mask_array[0] = (unsigned long) mask->val; > +#if BITS_PER_LONG != 64 > + mask_array[1] = mask->val >> BITS_PER_LONG; > +#endif Why not just "if (BITS_PER_LONG != 64)"? Compiler will do its job here. > + spin_unlock(&cap_userns_lock); > + > + t = *table; > + t.data = &mask_array; > + > + /* > + * actually read or write and array of ulongs from userspace. Remember > + * these are least significant bits first > + */ > + err = proc_doulongvec_minmax(&t, write, buffer, lenp, ppos); > + if (err < 0) > + return err; > + > + new_mask.val = mask_array[0]; > +#if BITS_PER_LONG != 64 > + new_mask.val += (u64)mask_array[1] << BITS_PER_LONG; > +#endif Ditto. > + > + /* > + * Drop everything not in the new_mask (but don't add things) > + */ > + if (write) { > + spin_lock(&cap_userns_lock); > + *mask = cap_intersect(*mask, new_mask); > + spin_unlock(&cap_userns_lock); > + } > + > + return 0; > +} > +#endif > + > static bool new_idmap_permitted(const struct file *file, > struct user_namespace *ns, int cap_setid, > struct uid_gid_map *map); > @@ -46,6 +106,12 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > /* Limit userns capabilities to our parent's bounding set. */ > if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS)) > cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset); > +#ifdef CONFIG_SYSCTL > + /* Mask off userns capabilities that are not permitted by the system-wide mask. */ > + spin_lock(&cap_userns_lock); > + cred->cap_userns = cap_intersect(cred->cap_userns, cap_userns_mask); > + spin_unlock(&cap_userns_lock); > +#endif > > /* Start with the capabilities defined in the userns set. */ > cred->cap_bset = cred->cap_userns; BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 3/3] capabilities: add cap userns sysctl mask 2024-05-16 9:22 ` [PATCH 3/3] capabilities: add cap userns sysctl mask Jonathan Calmels 2024-05-16 12:44 ` Jarkko Sakkinen @ 2024-05-20 3:38 ` Serge E. Hallyn 2024-05-20 13:30 ` Tycho Andersen 2 siblings, 0 replies; 53+ messages in thread From: Serge E. Hallyn @ 2024-05-20 3:38 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu, May 16, 2024 at 02:22:05AM -0700, Jonathan Calmels wrote: > This patch adds a new system-wide userns capability mask designed to mask > off capabilities in user namespaces. > > This mask is controlled through a sysctl and can be set early in the boot > process or on the kernel command line to exclude known capabilities from > ever being gained in namespaces. Once set, it can be further restricted to > exert dynamic policies on the system (e.g. ward off a potential exploit). > > Changing this mask requires privileges over CAP_SYS_ADMIN and CAP_SETPCAP > in the initial user namespace. > > Example: > > # sysctl -qw kernel.cap_userns_mask=0x1fffffdffff && \ > unshare -r grep Cap /proc/self/status > CapInh: 0000000000000000 > CapPrm: 000001fffffdffff > CapEff: 000001fffffdffff > CapBnd: 000001fffffdffff > CapAmb: 0000000000000000 > CapUNs: 000001fffffdffff > > Signed-off-by: Jonathan Calmels <jcalmels@3xx0.net> Reviewed-by: Serge Hallyn <serge@hallyn.com> > --- > include/linux/user_namespace.h | 7 ++++ > kernel/sysctl.c | 10 ++++++ > kernel/user_namespace.c | 66 ++++++++++++++++++++++++++++++++++ > 3 files changed, 83 insertions(+) > > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h > index 6030a8235617..e3478bd54ee5 100644 > --- a/include/linux/user_namespace.h > +++ b/include/linux/user_namespace.h > @@ -2,6 +2,7 @@ > #ifndef _LINUX_USER_NAMESPACE_H > #define _LINUX_USER_NAMESPACE_H > > +#include <linux/capability.h> > #include <linux/kref.h> > #include <linux/nsproxy.h> > #include <linux/ns_common.h> > @@ -14,6 +15,12 @@ > #define UID_GID_MAP_MAX_BASE_EXTENTS 5 > #define UID_GID_MAP_MAX_EXTENTS 340 > > +#ifdef CONFIG_SYSCTL > +extern kernel_cap_t cap_userns_mask; > +int proc_cap_userns_handler(struct ctl_table *table, int write, > + void *buffer, size_t *lenp, loff_t *ppos); > +#endif > + > struct uid_gid_extent { > u32 first; > u32 lower_first; > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 81cc974913bb..1546eebd6aea 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -62,6 +62,7 @@ > #include <linux/sched/sysctl.h> > #include <linux/mount.h> > #include <linux/userfaultfd_k.h> > +#include <linux/user_namespace.h> > #include <linux/pid.h> > > #include "../lib/kstrtox.h" > @@ -1846,6 +1847,15 @@ static struct ctl_table kern_table[] = { > .mode = 0444, > .proc_handler = proc_dointvec, > }, > +#ifdef CONFIG_USER_NS > + { > + .procname = "cap_userns_mask", > + .data = &cap_userns_mask, > + .maxlen = sizeof(kernel_cap_t), > + .mode = 0644, > + .proc_handler = proc_cap_userns_handler, > + }, > +#endif > #if defined(CONFIG_X86_LOCAL_APIC) && defined(CONFIG_X86) > { > .procname = "unknown_nmi_panic", > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c > index 53848e2b68cd..e0cf606e9140 100644 > --- a/kernel/user_namespace.c > +++ b/kernel/user_namespace.c > @@ -26,6 +26,66 @@ > static struct kmem_cache *user_ns_cachep __ro_after_init; > static DEFINE_MUTEX(userns_state_mutex); > > +#ifdef CONFIG_SYSCTL > +static DEFINE_SPINLOCK(cap_userns_lock); > +kernel_cap_t cap_userns_mask = CAP_FULL_SET; > + > +int proc_cap_userns_handler(struct ctl_table *table, int write, > + void *buffer, size_t *lenp, loff_t *ppos) > +{ > + struct ctl_table t; > + unsigned long mask_array[2]; > + kernel_cap_t new_mask, *mask; > + int err; > + > + if (write && (!capable(CAP_SETPCAP) || > + !capable(CAP_SYS_ADMIN))) > + return -EPERM; > + > + /* > + * convert from the global kernel_cap_t to the ulong array to print to > + * userspace if this is a read. > + * > + * capabilities are exposed as one 64-bit value or two 32-bit values > + * depending on the architecture > + */ > + mask = table->data; > + spin_lock(&cap_userns_lock); > + mask_array[0] = (unsigned long) mask->val; > +#if BITS_PER_LONG != 64 > + mask_array[1] = mask->val >> BITS_PER_LONG; > +#endif > + spin_unlock(&cap_userns_lock); > + > + t = *table; > + t.data = &mask_array; > + > + /* > + * actually read or write and array of ulongs from userspace. Remember > + * these are least significant bits first > + */ > + err = proc_doulongvec_minmax(&t, write, buffer, lenp, ppos); > + if (err < 0) > + return err; > + > + new_mask.val = mask_array[0]; > +#if BITS_PER_LONG != 64 > + new_mask.val += (u64)mask_array[1] << BITS_PER_LONG; > +#endif > + > + /* > + * Drop everything not in the new_mask (but don't add things) > + */ > + if (write) { > + spin_lock(&cap_userns_lock); > + *mask = cap_intersect(*mask, new_mask); > + spin_unlock(&cap_userns_lock); > + } > + > + return 0; > +} > +#endif > + > static bool new_idmap_permitted(const struct file *file, > struct user_namespace *ns, int cap_setid, > struct uid_gid_map *map); > @@ -46,6 +106,12 @@ static void set_cred_user_ns(struct cred *cred, struct user_namespace *user_ns) > /* Limit userns capabilities to our parent's bounding set. */ > if (iscredsecure(cred, SECURE_USERNS_STRICT_CAPS)) > cred->cap_userns = cap_intersect(cred->cap_userns, cred->cap_bset); > +#ifdef CONFIG_SYSCTL > + /* Mask off userns capabilities that are not permitted by the system-wide mask. */ > + spin_lock(&cap_userns_lock); > + cred->cap_userns = cap_intersect(cred->cap_userns, cap_userns_mask); > + spin_unlock(&cap_userns_lock); > +#endif > > /* Start with the capabilities defined in the userns set. */ > cred->cap_bset = cred->cap_userns; > -- > 2.45.0 > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 3/3] capabilities: add cap userns sysctl mask 2024-05-16 9:22 ` [PATCH 3/3] capabilities: add cap userns sysctl mask Jonathan Calmels 2024-05-16 12:44 ` Jarkko Sakkinen 2024-05-20 3:38 ` Serge E. Hallyn @ 2024-05-20 13:30 ` Tycho Andersen 2024-05-20 19:25 ` Jonathan Calmels 2 siblings, 1 reply; 53+ messages in thread From: Tycho Andersen @ 2024-05-20 13:30 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings Hi Jonathan, On Thu, May 16, 2024 at 02:22:05AM -0700, Jonathan Calmels wrote: > +int proc_cap_userns_handler(struct ctl_table *table, int write, > + void *buffer, size_t *lenp, loff_t *ppos) > +{ there is an ongoing effort (started at [0]) to constify the first arg here, since you're not supposed to write to it. Your usage looks correct to me, so I think all it needs is a literal "const" here. [0]: https://lore.kernel.org/lkml/20240423-sysctl-const-handler-v3-0-e0beccb836e2@weissschuh.net/ > + struct ctl_table t; > + unsigned long mask_array[2]; > + kernel_cap_t new_mask, *mask; > + int err; > + > + if (write && (!capable(CAP_SETPCAP) || > + !capable(CAP_SYS_ADMIN))) > + return -EPERM; ...why CAP_SYS_ADMIN? You mention it in the changelog, but don't explain why. Tycho ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 3/3] capabilities: add cap userns sysctl mask 2024-05-20 13:30 ` Tycho Andersen @ 2024-05-20 19:25 ` Jonathan Calmels 2024-05-20 21:13 ` Tycho Andersen 0 siblings, 1 reply; 53+ messages in thread From: Jonathan Calmels @ 2024-05-20 19:25 UTC (permalink / raw To: Tycho Andersen Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Mon, May 20, 2024 at 07:30:14AM GMT, Tycho Andersen wrote: > there is an ongoing effort (started at [0]) to constify the first arg > here, since you're not supposed to write to it. Your usage looks > correct to me, so I think all it needs is a literal "const" here. Will do, along with the suggestions from Jarkko > > + struct ctl_table t; > > + unsigned long mask_array[2]; > > + kernel_cap_t new_mask, *mask; > > + int err; > > + > > + if (write && (!capable(CAP_SETPCAP) || > > + !capable(CAP_SYS_ADMIN))) > > + return -EPERM; > > ...why CAP_SYS_ADMIN? You mention it in the changelog, but don't > explain why. No reason really, I was hoping we could decide what we want here. UMH uses CAP_SYS_MODULE, Serge mentioned adding a new cap maybe. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 3/3] capabilities: add cap userns sysctl mask 2024-05-20 19:25 ` Jonathan Calmels @ 2024-05-20 21:13 ` Tycho Andersen 2024-05-20 22:12 ` Jarkko Sakkinen 0 siblings, 1 reply; 53+ messages in thread From: Tycho Andersen @ 2024-05-20 21:13 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Mon, May 20, 2024 at 12:25:27PM -0700, Jonathan Calmels wrote: > On Mon, May 20, 2024 at 07:30:14AM GMT, Tycho Andersen wrote: > > there is an ongoing effort (started at [0]) to constify the first arg > > here, since you're not supposed to write to it. Your usage looks > > correct to me, so I think all it needs is a literal "const" here. > > Will do, along with the suggestions from Jarkko > > > > + struct ctl_table t; > > > + unsigned long mask_array[2]; > > > + kernel_cap_t new_mask, *mask; > > > + int err; > > > + > > > + if (write && (!capable(CAP_SETPCAP) || > > > + !capable(CAP_SYS_ADMIN))) > > > + return -EPERM; > > > > ...why CAP_SYS_ADMIN? You mention it in the changelog, but don't > > explain why. > > No reason really, I was hoping we could decide what we want here. > UMH uses CAP_SYS_MODULE, Serge mentioned adding a new cap maybe. I don't have a strong preference between SETPCAP and a new capability, but I do think it should be just one. SYS_ADMIN is already god mode enough, IMO. Tycho ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 3/3] capabilities: add cap userns sysctl mask 2024-05-20 21:13 ` Tycho Andersen @ 2024-05-20 22:12 ` Jarkko Sakkinen 2024-05-21 14:29 ` Tycho Andersen 0 siblings, 1 reply; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-20 22:12 UTC (permalink / raw To: Tycho Andersen, Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Tue May 21, 2024 at 12:13 AM EEST, Tycho Andersen wrote: > On Mon, May 20, 2024 at 12:25:27PM -0700, Jonathan Calmels wrote: > > On Mon, May 20, 2024 at 07:30:14AM GMT, Tycho Andersen wrote: > > > there is an ongoing effort (started at [0]) to constify the first arg > > > here, since you're not supposed to write to it. Your usage looks > > > correct to me, so I think all it needs is a literal "const" here. > > > > Will do, along with the suggestions from Jarkko > > > > > > + struct ctl_table t; > > > > + unsigned long mask_array[2]; > > > > + kernel_cap_t new_mask, *mask; > > > > + int err; > > > > + > > > > + if (write && (!capable(CAP_SETPCAP) || > > > > + !capable(CAP_SYS_ADMIN))) > > > > + return -EPERM; > > > > > > ...why CAP_SYS_ADMIN? You mention it in the changelog, but don't > > > explain why. > > > > No reason really, I was hoping we could decide what we want here. > > UMH uses CAP_SYS_MODULE, Serge mentioned adding a new cap maybe. > > I don't have a strong preference between SETPCAP and a new capability, > but I do think it should be just one. SYS_ADMIN is already god mode > enough, IMO. Sometimes I think would it make more sense to invent something completely new like capabilities but more modern and robust, instead of increasing complexity of a broken mechanism (especially thanks to CAP_MAC_ADMIN). I kind of liked the idea of privilege tokens both in Symbian and Maemo (have been involved professionally in both). Emphasis on the idea not necessarily on implementation. Not an LSM but like something that you could use in the place of POSIX caps. Probably quite tedious effort tho because you would need to pull the whole industry with the new thing... BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 3/3] capabilities: add cap userns sysctl mask 2024-05-20 22:12 ` Jarkko Sakkinen @ 2024-05-21 14:29 ` Tycho Andersen 2024-05-21 14:45 ` Jarkko Sakkinen 0 siblings, 1 reply; 53+ messages in thread From: Tycho Andersen @ 2024-05-21 14:29 UTC (permalink / raw To: Jarkko Sakkinen Cc: Jonathan Calmels, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Tue, May 21, 2024 at 01:12:57AM +0300, Jarkko Sakkinen wrote: > On Tue May 21, 2024 at 12:13 AM EEST, Tycho Andersen wrote: > > On Mon, May 20, 2024 at 12:25:27PM -0700, Jonathan Calmels wrote: > > > On Mon, May 20, 2024 at 07:30:14AM GMT, Tycho Andersen wrote: > > > > there is an ongoing effort (started at [0]) to constify the first arg > > > > here, since you're not supposed to write to it. Your usage looks > > > > correct to me, so I think all it needs is a literal "const" here. > > > > > > Will do, along with the suggestions from Jarkko > > > > > > > > + struct ctl_table t; > > > > > + unsigned long mask_array[2]; > > > > > + kernel_cap_t new_mask, *mask; > > > > > + int err; > > > > > + > > > > > + if (write && (!capable(CAP_SETPCAP) || > > > > > + !capable(CAP_SYS_ADMIN))) > > > > > + return -EPERM; > > > > > > > > ...why CAP_SYS_ADMIN? You mention it in the changelog, but don't > > > > explain why. > > > > > > No reason really, I was hoping we could decide what we want here. > > > UMH uses CAP_SYS_MODULE, Serge mentioned adding a new cap maybe. > > > > I don't have a strong preference between SETPCAP and a new capability, > > but I do think it should be just one. SYS_ADMIN is already god mode > > enough, IMO. > > Sometimes I think would it make more sense to invent something > completely new like capabilities but more modern and robust, instead of > increasing complexity of a broken mechanism (especially thanks to > CAP_MAC_ADMIN). > > I kind of liked the idea of privilege tokens both in Symbian and Maemo > (have been involved professionally in both). Emphasis on the idea not > necessarily on implementation. > > Not an LSM but like something that you could use in the place of POSIX > caps. Probably quite tedious effort tho because you would need to pull > the whole industry with the new thing... And then we have LSM hooks, (ns_)capable(), __secure_computing() plus a new set of hooks for this new thing sprinkled around. I guess kernel developers wouldn't be excited about it, let alone the rest of the industry :) Thinking out loud: I wonder if fixing the seccomp TOCTOU against pointers would help here. I guess you'd still have issues where your policy engine resolves a path arg to open() and that inode changes between the decision and the actual vfs access, you have just changed the TOCTOU. Or even scarier: what if you could change the return value at any kprobe? :) Tycho ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 3/3] capabilities: add cap userns sysctl mask 2024-05-21 14:29 ` Tycho Andersen @ 2024-05-21 14:45 ` Jarkko Sakkinen 0 siblings, 0 replies; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-21 14:45 UTC (permalink / raw To: Tycho Andersen Cc: Jonathan Calmels, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Tue May 21, 2024 at 5:29 PM EEST, Tycho Andersen wrote: > On Tue, May 21, 2024 at 01:12:57AM +0300, Jarkko Sakkinen wrote: > > On Tue May 21, 2024 at 12:13 AM EEST, Tycho Andersen wrote: > > > On Mon, May 20, 2024 at 12:25:27PM -0700, Jonathan Calmels wrote: > > > > On Mon, May 20, 2024 at 07:30:14AM GMT, Tycho Andersen wrote: > > > > > there is an ongoing effort (started at [0]) to constify the first arg > > > > > here, since you're not supposed to write to it. Your usage looks > > > > > correct to me, so I think all it needs is a literal "const" here. > > > > > > > > Will do, along with the suggestions from Jarkko > > > > > > > > > > + struct ctl_table t; > > > > > > + unsigned long mask_array[2]; > > > > > > + kernel_cap_t new_mask, *mask; > > > > > > + int err; > > > > > > + > > > > > > + if (write && (!capable(CAP_SETPCAP) || > > > > > > + !capable(CAP_SYS_ADMIN))) > > > > > > + return -EPERM; > > > > > > > > > > ...why CAP_SYS_ADMIN? You mention it in the changelog, but don't > > > > > explain why. > > > > > > > > No reason really, I was hoping we could decide what we want here. > > > > UMH uses CAP_SYS_MODULE, Serge mentioned adding a new cap maybe. > > > > > > I don't have a strong preference between SETPCAP and a new capability, > > > but I do think it should be just one. SYS_ADMIN is already god mode > > > enough, IMO. > > > > Sometimes I think would it make more sense to invent something > > completely new like capabilities but more modern and robust, instead of > > increasing complexity of a broken mechanism (especially thanks to > > CAP_MAC_ADMIN). > > > > I kind of liked the idea of privilege tokens both in Symbian and Maemo > > (have been involved professionally in both). Emphasis on the idea not > > necessarily on implementation. > > > > Not an LSM but like something that you could use in the place of POSIX > > caps. Probably quite tedious effort tho because you would need to pull > > the whole industry with the new thing... > > And then we have LSM hooks, (ns_)capable(), __secure_computing() plus > a new set of hooks for this new thing sprinkled around. I guess > kernel developers wouldn't be excited about it, let alone the rest of > the industry :) > > Thinking out loud: I wonder if fixing the seccomp TOCTOU against > pointers would help here. I guess you'd still have issues where your > policy engine resolves a path arg to open() and that inode changes > between the decision and the actual vfs access, you have just changed > the TOCTOU. > > Or even scarier: what if you could change the return value at any > kprobe? :) I had one crazy idea related to seccomp filters once. What if there was way to compose tokens that would be just a seccomp filter like the one that you pass to PR_SET_SECCOMP but presented with a file descriptor? Then you could send these with SCM_RIGHTS to other processes and they could upgrade their existing filter with them. So it would be a kind of extension mechanism for a seccomp filter. Not something I'm seriously suggesting but though to flush this out now that we are on these topics anyhow ;-) > Tycho PS. Sorry if my language was a bit harsh earlier but I think I had also a point related to at least to the patch set presentation. I.e. you are very precise describing the mechanism but motivation and bringing topic somehow to a context is equally important :-) BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-16 9:22 [PATCH 0/3] Introduce user namespace capabilities Jonathan Calmels ` (2 preceding siblings ...) 2024-05-16 9:22 ` [PATCH 3/3] capabilities: add cap userns sysctl mask Jonathan Calmels @ 2024-05-16 13:30 ` Ben Boeckel 2024-05-16 13:36 ` Jarkko Sakkinen 2024-05-16 16:23 ` Paul Moore 2024-05-16 19:07 ` Casey Schaufler 5 siblings, 1 reply; 53+ messages in thread From: Ben Boeckel @ 2024-05-16 13:30 UTC (permalink / raw To: Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu, May 16, 2024 at 02:22:02 -0700, Jonathan Calmels wrote: > Jonathan Calmels (3): > capabilities: user namespace capabilities > capabilities: add securebit for strict userns caps > capabilities: add cap userns sysctl mask > > fs/proc/array.c | 9 ++++ > include/linux/cred.h | 3 ++ > include/linux/securebits.h | 1 + > include/linux/user_namespace.h | 7 +++ > include/uapi/linux/prctl.h | 7 +++ > include/uapi/linux/securebits.h | 11 ++++- > kernel/cred.c | 3 ++ > kernel/sysctl.c | 10 ++++ > kernel/umh.c | 16 +++++++ > kernel/user_namespace.c | 83 ++++++++++++++++++++++++++++++--- > security/commoncap.c | 59 +++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 12 files changed, 204 insertions(+), 8 deletions(-) I note a lack of any changes to `Documentation/` which seems quite glaring for something with such a userspace visibility aspect to it. --Ben ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-16 13:30 ` [PATCH 0/3] Introduce user namespace capabilities Ben Boeckel @ 2024-05-16 13:36 ` Jarkko Sakkinen 2024-05-17 10:00 ` Jonathan Calmels 0 siblings, 1 reply; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-16 13:36 UTC (permalink / raw To: Ben Boeckel, Jonathan Calmels Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu May 16, 2024 at 4:30 PM EEST, Ben Boeckel wrote: > On Thu, May 16, 2024 at 02:22:02 -0700, Jonathan Calmels wrote: > > Jonathan Calmels (3): > > capabilities: user namespace capabilities > > capabilities: add securebit for strict userns caps > > capabilities: add cap userns sysctl mask > > > > fs/proc/array.c | 9 ++++ > > include/linux/cred.h | 3 ++ > > include/linux/securebits.h | 1 + > > include/linux/user_namespace.h | 7 +++ > > include/uapi/linux/prctl.h | 7 +++ > > include/uapi/linux/securebits.h | 11 ++++- > > kernel/cred.c | 3 ++ > > kernel/sysctl.c | 10 ++++ > > kernel/umh.c | 16 +++++++ > > kernel/user_namespace.c | 83 ++++++++++++++++++++++++++++++--- > > security/commoncap.c | 59 +++++++++++++++++++++++ > > security/keys/process_keys.c | 3 ++ > > 12 files changed, 204 insertions(+), 8 deletions(-) > > I note a lack of any changes to `Documentation/` which seems quite > glaring for something with such a userspace visibility aspect to it. > > --Ben Yeah, also in cover letter it would be nice to refresh what is a bounding set. I had to xref that (recalled what it is), and then got bored reading the rest :-) Not exactly in the nutshell cover letter tbh, but maybe the content in that would be better put to Documentation/ BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-16 13:36 ` Jarkko Sakkinen @ 2024-05-17 10:00 ` Jonathan Calmels 0 siblings, 0 replies; 53+ messages in thread From: Jonathan Calmels @ 2024-05-17 10:00 UTC (permalink / raw To: Jarkko Sakkinen Cc: Ben Boeckel, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu, May 16, 2024 at 04:36:07PM GMT, Jarkko Sakkinen wrote: > On Thu May 16, 2024 at 4:30 PM EEST, Ben Boeckel wrote: > > I note a lack of any changes to `Documentation/` which seems quite > > glaring for something with such a userspace visibility aspect to it. > > > > --Ben > > Yeah, also in cover letter it would be nice to refresh what is > a bounding set. I had to xref that (recalled what it is), and > then got bored reading the rest :-) Thanks for reminding me, I actually meant to do it, just forgot. Having said that, `Documentation/security/credentials.rst` is not the best documention when it comes to capabilities. I will definitely add few more lines in there, but it's probably not what you're looking for. capabilities(7) is where everything is explained, I should have mentioned it. I could try to summarize the existing sets, but honestly I will probably do a worse job than the man page. I do plan to update the man page though if it comes to that. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-16 9:22 [PATCH 0/3] Introduce user namespace capabilities Jonathan Calmels ` (3 preceding siblings ...) 2024-05-16 13:30 ` [PATCH 0/3] Introduce user namespace capabilities Ben Boeckel @ 2024-05-16 16:23 ` Paul Moore 2024-05-16 17:18 ` Jarkko Sakkinen 2024-05-16 19:07 ` Casey Schaufler 5 siblings, 1 reply; 53+ messages in thread From: Paul Moore @ 2024-05-16 16:23 UTC (permalink / raw To: Jonathan Calmels, Serge Hallyn Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, James Morris, David Howells, Jarkko Sakkinen, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu, May 16, 2024 at 5:21 AM Jonathan Calmels <jcalmels@3xx0.net> wrote: > > It's that time of the year again where we debate security settings for user > namespaces ;) > > I’ve been experimenting with different approaches to address the gripe > around user namespaces being used as attack vectors. > After invaluable feedback from Serge and Christian offline, this is what I > came up with. As Serge is the capabilities maintainer it would be good to hear his thoughts on-list about this proposal. > There are obviously a lot of things we could do differently but I feel this > is the right balance between functionality, simplicity and security. This > also serves as a good foundation and could always be extended if the need > arises in the future. > > Notes: > > - Adding a new capability set is far from ideal, but trying to reuse the > existing capability framework was deemed both impractical and > questionable security-wise, so here we are. > > - We might want to add new capabilities for some of the checks instead of > reusing CAP_SETPCAP every time. Serge mentioned something like > CAP_SYS_LIMIT? > > - In the last patch, we could decide to have stronger requirements and > perform checks inside cap_capable() in case we want to retroactively > prevent capabilities in old namespaces, this might be an overreach though > so I left it out. > > I'm also not fond of the ulong logic for setting the sysctl parameter, on > the other hand, the usermodhelper code always uses two u32s which makes it > very confusing to set in userspace. > > > Jonathan Calmels (3): > capabilities: user namespace capabilities > capabilities: add securebit for strict userns caps > capabilities: add cap userns sysctl mask > > fs/proc/array.c | 9 ++++ > include/linux/cred.h | 3 ++ > include/linux/securebits.h | 1 + > include/linux/user_namespace.h | 7 +++ > include/uapi/linux/prctl.h | 7 +++ > include/uapi/linux/securebits.h | 11 ++++- > kernel/cred.c | 3 ++ > kernel/sysctl.c | 10 ++++ > kernel/umh.c | 16 +++++++ > kernel/user_namespace.c | 83 ++++++++++++++++++++++++++++++--- > security/commoncap.c | 59 +++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 12 files changed, 204 insertions(+), 8 deletions(-) -- paul-moore.com ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-16 16:23 ` Paul Moore @ 2024-05-16 17:18 ` Jarkko Sakkinen 0 siblings, 0 replies; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-16 17:18 UTC (permalink / raw To: Paul Moore, Jonathan Calmels, Serge Hallyn Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu May 16, 2024 at 7:23 PM EEST, Paul Moore wrote: > On Thu, May 16, 2024 at 5:21 AM Jonathan Calmels <jcalmels@3xx0.net> wrote: > > > > It's that time of the year again where we debate security settings for user > > namespaces ;) > > > > I’ve been experimenting with different approaches to address the gripe > > around user namespaces being used as attack vectors. > > After invaluable feedback from Serge and Christian offline, this is what I > > came up with. > > As Serge is the capabilities maintainer it would be good to hear his > thoughts on-list about this proposal. Also it would make sense to make this just a bit more digestible to a wider group of maintainers, i.e. a better introduction to the topic instead of huge list of references (no bandwidth to read them all). This is exactly kind of patch set that makes you ignore it unless you are pro-active exactly in this domain. I think this could bring more actually useful feedback. BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-16 9:22 [PATCH 0/3] Introduce user namespace capabilities Jonathan Calmels ` (4 preceding siblings ...) 2024-05-16 16:23 ` Paul Moore @ 2024-05-16 19:07 ` Casey Schaufler 2024-05-16 19:29 ` Jarkko Sakkinen 5 siblings, 1 reply; 53+ messages in thread From: Casey Schaufler @ 2024-05-16 19:07 UTC (permalink / raw To: Jonathan Calmels, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, Jarkko Sakkinen, Casey Schaufler Cc: containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On 5/16/2024 2:22 AM, Jonathan Calmels wrote: > It's that time of the year again where we debate security settings for user > namespaces ;) > > I’ve been experimenting with different approaches to address the gripe > around user namespaces being used as attack vectors. > After invaluable feedback from Serge and Christian offline, this is what I > came up with. > > There are obviously a lot of things we could do differently but I feel this > is the right balance between functionality, simplicity and security. This > also serves as a good foundation and could always be extended if the need > arises in the future. > > Notes: > > - Adding a new capability set is far from ideal, but trying to reuse the > existing capability framework was deemed both impractical and > questionable security-wise, so here we are. I suggest that adding a capability set for user namespaces is a bad idea: - It is in no way obvious what problem it solves - It is not obvious how it solves any problem - The capability mechanism has not been popular, and relying on a community (e.g. container developers) to embrace it based on this enhancement is a recipe for failure - Capabilities are already more complicated than modern developers want to deal with. Adding another, special purpose set, is going to make them even more difficult to use. > - We might want to add new capabilities for some of the checks instead of > reusing CAP_SETPCAP every time. Serge mentioned something like > CAP_SYS_LIMIT? > > - In the last patch, we could decide to have stronger requirements and > perform checks inside cap_capable() in case we want to retroactively > prevent capabilities in old namespaces, this might be an overreach though > so I left it out. > > I'm also not fond of the ulong logic for setting the sysctl parameter, on > the other hand, the usermodhelper code always uses two u32s which makes it > very confusing to set in userspace. > > > Jonathan Calmels (3): > capabilities: user namespace capabilities > capabilities: add securebit for strict userns caps > capabilities: add cap userns sysctl mask > > fs/proc/array.c | 9 ++++ > include/linux/cred.h | 3 ++ > include/linux/securebits.h | 1 + > include/linux/user_namespace.h | 7 +++ > include/uapi/linux/prctl.h | 7 +++ > include/uapi/linux/securebits.h | 11 ++++- > kernel/cred.c | 3 ++ > kernel/sysctl.c | 10 ++++ > kernel/umh.c | 16 +++++++ > kernel/user_namespace.c | 83 ++++++++++++++++++++++++++++++--- > security/commoncap.c | 59 +++++++++++++++++++++++ > security/keys/process_keys.c | 3 ++ > 12 files changed, 204 insertions(+), 8 deletions(-) > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-16 19:07 ` Casey Schaufler @ 2024-05-16 19:29 ` Jarkko Sakkinen 2024-05-16 19:31 ` Jarkko Sakkinen 0 siblings, 1 reply; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-16 19:29 UTC (permalink / raw To: Casey Schaufler, Jonathan Calmels, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells Cc: containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu May 16, 2024 at 10:07 PM EEST, Casey Schaufler wrote: > I suggest that adding a capability set for user namespaces is a bad idea: > - It is in no way obvious what problem it solves > - It is not obvious how it solves any problem > - The capability mechanism has not been popular, and relying on a > community (e.g. container developers) to embrace it based on this > enhancement is a recipe for failure > - Capabilities are already more complicated than modern developers > want to deal with. Adding another, special purpose set, is going > to make them even more difficult to use. What Inh, Prm, Eff, Bnd and Amb is not dead obvious to you? ;-) One UNs cannot hurt... I'm not following containers that much but didn't seccomp profiles supposed to be the silver bullet? BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-16 19:29 ` Jarkko Sakkinen @ 2024-05-16 19:31 ` Jarkko Sakkinen 2024-05-16 20:00 ` Jarkko Sakkinen 0 siblings, 1 reply; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-16 19:31 UTC (permalink / raw To: Jarkko Sakkinen, Casey Schaufler, Jonathan Calmels, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells Cc: containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu May 16, 2024 at 10:29 PM EEST, Jarkko Sakkinen wrote: > On Thu May 16, 2024 at 10:07 PM EEST, Casey Schaufler wrote: > > I suggest that adding a capability set for user namespaces is a bad idea: > > - It is in no way obvious what problem it solves > > - It is not obvious how it solves any problem > > - The capability mechanism has not been popular, and relying on a > > community (e.g. container developers) to embrace it based on this > > enhancement is a recipe for failure > > - Capabilities are already more complicated than modern developers > > want to deal with. Adding another, special purpose set, is going > > to make them even more difficult to use. > > What Inh, Prm, Eff, Bnd and Amb is not dead obvious to you? ;-) > One UNs cannot hurt... > > I'm not following containers that much but didn't seccomp profiles > supposed to be the silver bullet? Also, I think Kata Containers style way of doing containers is pretty solid. I've heard that some video streaming service at least in recent past did launch VM per stream so it's not like VM's cannot be made to scale I guess. BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-16 19:31 ` Jarkko Sakkinen @ 2024-05-16 20:00 ` Jarkko Sakkinen 2024-05-17 11:42 ` Jonathan Calmels 0 siblings, 1 reply; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-16 20:00 UTC (permalink / raw To: Jarkko Sakkinen, Casey Schaufler, Jonathan Calmels, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells Cc: containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Thu May 16, 2024 at 10:31 PM EEST, Jarkko Sakkinen wrote: > On Thu May 16, 2024 at 10:29 PM EEST, Jarkko Sakkinen wrote: > > On Thu May 16, 2024 at 10:07 PM EEST, Casey Schaufler wrote: > > > I suggest that adding a capability set for user namespaces is a bad idea: > > > - It is in no way obvious what problem it solves > > > - It is not obvious how it solves any problem > > > - The capability mechanism has not been popular, and relying on a > > > community (e.g. container developers) to embrace it based on this > > > enhancement is a recipe for failure > > > - Capabilities are already more complicated than modern developers > > > want to deal with. Adding another, special purpose set, is going > > > to make them even more difficult to use. > > > > What Inh, Prm, Eff, Bnd and Amb is not dead obvious to you? ;-) > > One UNs cannot hurt... > > > > I'm not following containers that much but didn't seccomp profiles > > supposed to be the silver bullet? > > Also, I think Kata Containers style way of doing containers is pretty > solid. I've heard that some video streaming service at least in recent > past did launch VM per stream so it's not like VM's cannot be made to > scale I guess. Sorry for multiple responses but this actually nails the key question: who will use this? Even if this would work out somehow, is there someone who will actually use this, and not few other more robust solutions available? I mean it is worth of time to maintain it, if there is no potential users for a feature. In addition to "show me the code", there is always also "show me the payload". BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-16 20:00 ` Jarkko Sakkinen @ 2024-05-17 11:42 ` Jonathan Calmels 2024-05-17 17:53 ` Casey Schaufler 0 siblings, 1 reply; 53+ messages in thread From: Jonathan Calmels @ 2024-05-17 11:42 UTC (permalink / raw To: Jarkko Sakkinen Cc: Casey Schaufler, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings > > > On Thu May 16, 2024 at 10:07 PM EEST, Casey Schaufler wrote: > > > > I suggest that adding a capability set for user namespaces is a bad idea: > > > > - It is in no way obvious what problem it solves > > > > - It is not obvious how it solves any problem > > > > - The capability mechanism has not been popular, and relying on a > > > > community (e.g. container developers) to embrace it based on this > > > > enhancement is a recipe for failure > > > > - Capabilities are already more complicated than modern developers > > > > want to deal with. Adding another, special purpose set, is going > > > > to make them even more difficult to use. Sorry if the commit wasn't clear enough. Basically: - Today user namespaces grant full capabilities. This behavior is often abused to attack various kernel subsystems. Only option is to disable them altogether which breaks a lot of userspace stuff. This goes against the least privilege principle. - It adds a new capability set. This set dictates what capabilities are granted in namespaces (instead of always getting full caps). This brings namespaces in line with the rest of the system, user namespaces are no more "special". They now work the same way as say a transition to root does with inheritable caps. - This isn't intended to be used by end users per se (although they could). This would be used at the same places where existing capabalities are used today (e.g. init system, pam, container runtime, browser sandbox), or by system administrators. To give you some ideas of things you could do: # E.g. prevent alice from getting CAP_NET_ADMIN in user namespaces under SSH echo "auth optional pam_cap.so" >> /etc/pam.d/sshd echo "!cap_net_admin alice" >> /etc/security/capability.conf. # E.g. prevent any Docker container from ever getting CAP_DAC_OVERRIDE systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \ -p SecureBits=userns-strict-caps \ /usr/bin/dockerd # E.g. kernel could be vulnerable to CAP_SYS_RAWIO exploits # Prevent users from ever gaining it sysctl -w cap_bound_userns_mask=0x1fffffdffff ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-17 11:42 ` Jonathan Calmels @ 2024-05-17 17:53 ` Casey Schaufler 2024-05-17 19:11 ` Jonathan Calmels 2024-05-18 12:20 ` Serge Hallyn 0 siblings, 2 replies; 53+ messages in thread From: Casey Schaufler @ 2024-05-17 17:53 UTC (permalink / raw To: Jonathan Calmels, Jarkko Sakkinen Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings, Casey Schaufler On 5/17/2024 4:42 AM, Jonathan Calmels wrote: >>>> On Thu May 16, 2024 at 10:07 PM EEST, Casey Schaufler wrote: >>>>> I suggest that adding a capability set for user namespaces is a bad idea: >>>>> - It is in no way obvious what problem it solves >>>>> - It is not obvious how it solves any problem >>>>> - The capability mechanism has not been popular, and relying on a >>>>> community (e.g. container developers) to embrace it based on this >>>>> enhancement is a recipe for failure >>>>> - Capabilities are already more complicated than modern developers >>>>> want to deal with. Adding another, special purpose set, is going >>>>> to make them even more difficult to use. > Sorry if the commit wasn't clear enough. While, as others have pointed out, the commit description left much to be desired, that isn't the biggest problem with the change you're proposing. > Basically: > > - Today user namespaces grant full capabilities. Of course they do. I have been following the use of capabilities in Linux since before they were implemented. The uptake has been disappointing in all use cases. > This behavior is often abused to attack various kernel subsystems. Yes. The problems of a single, all powerful root privilege scheme are well documented. > Only option Hardly. > is to disable them altogether which breaks a lot of > userspace stuff. Updating userspace components to behave properly in a capabilities environment has never been a popular activity, but is the right way to address this issue. And before you start on the "no one can do that, it's too hard", I'll point out that multiple UNIX systems supported rootless, all capabilities based systems back in the day. > This goes against the least privilege principle. If you're going to run userspace that *requires* privilege, you have to have a way to *allow* privilege. If the userspace insists on a root based privilege model, you're stuck supporting it. Regardless of your principles. > > - It adds a new capability set. Which is a really, really bad idea. The equation for calculating effective privilege is already more complicated than userspace developers are generally willing to put up with. > This set dictates what capabilities are granted in namespaces (instead > of always getting full caps). I would not expect container developers to be eager to learn how to use this facility. > This brings namespaces in line with the rest of the system, user > namespaces are no more "special". I'm sorry, but this makes no sense to me whatsoever. You want to introduce a capability set explicitly for namespaces in order to make them less special? Maybe I'm just old and cranky. > They now work the same way as say a transition to root does with > inheritable caps. That needs some explanation. > > - This isn't intended to be used by end users per se (although they could). > This would be used at the same places where existing capabalities are > used today (e.g. init system, pam, container runtime, browser > sandbox), or by system administrators. I understand that. It is for containers. Containers are not kernel entities. > > To give you some ideas of things you could do: > > # E.g. prevent alice from getting CAP_NET_ADMIN in user namespaces under SSH > echo "auth optional pam_cap.so" >> /etc/pam.d/sshd > echo "!cap_net_admin alice" >> /etc/security/capability.conf. > > # E.g. prevent any Docker container from ever getting CAP_DAC_OVERRIDE > systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \ > -p SecureBits=userns-strict-caps \ > /usr/bin/dockerd > > # E.g. kernel could be vulnerable to CAP_SYS_RAWIO exploits > # Prevent users from ever gaining it > sysctl -w cap_bound_userns_mask=0x1fffffdffff ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-17 17:53 ` Casey Schaufler @ 2024-05-17 19:11 ` Jonathan Calmels 2024-05-18 11:08 ` Jarkko Sakkinen 2024-05-18 12:20 ` Serge Hallyn 1 sibling, 1 reply; 53+ messages in thread From: Jonathan Calmels @ 2024-05-17 19:11 UTC (permalink / raw To: Casey Schaufler Cc: Jarkko Sakkinen, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Fri, May 17, 2024 at 10:53:24AM GMT, Casey Schaufler wrote: > Of course they do. I have been following the use of capabilities > in Linux since before they were implemented. The uptake has been > disappointing in all use cases. Why "Of course"? What if they should not get *all* privileges? > Yes. The problems of a single, all powerful root privilege scheme are > well documented. That's my point, it doesn't have to be this way. > Hardly. Maybe I'm missing something, then. How do I restrict my users from gaining say CAP_NET_ADMIN in their userns today? > If you're going to run userspace that *requires* privilege, you have > to have a way to *allow* privilege. If the userspace insists on a root > based privilege model, you're stuck supporting it. Regardless of your > principles. I want *some* privileges, not *all* of them. > Which is a really, really bad idea. The equation for calculating effective > privilege is already more complicated than userspace developers are generally > willing to put up with. This is generally true, but this set is way more straightforward than the other sets, it's always: pU = pP = pE = X If you look at the patch, there is no transition logic or anything complicated, it's just a set of caps behind inherited. > I would not expect container developers to be eager to learn how to use > this facility. And they probably wouldn't. For most use cases it's going to be enforced through system policies (init, pam, etc). Other than that, usage won't change, you will run your usual `docker run --cap-add ...` to get caps, except now it works in userns. > I'm sorry, but this makes no sense to me whatsoever. You want to introduce > a capability set explicitly for namespaces in order to make them less > special? Maybe I'm just old and cranky. > > > They now work the same way as say a transition to root does with > > inheritable caps. > > That needs some explanation. From man capabilities(7): In order to mirror traditional UNIX semantics, the kernel performs special treatment of file capabilities when a process with UID 0 (root) executes a program [...] Thus, when [...] a process whose real and effective UIDs are zero execve(2)s a program, the calculation of the process's new permitted capabilities simplifies to: P'(permitted) = P(inheritable) | P(bounding) P'(effective) = P'(permitted) So, the same way a root process is bounded by its inheritable set when it execs, a "rootless" process is bounded by its userns set when it unshares. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-17 19:11 ` Jonathan Calmels @ 2024-05-18 11:08 ` Jarkko Sakkinen 2024-05-18 11:17 ` Jarkko Sakkinen 0 siblings, 1 reply; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-18 11:08 UTC (permalink / raw To: Jonathan Calmels, Casey Schaufler Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Fri May 17, 2024 at 10:11 PM EEST, Jonathan Calmels wrote: > On Fri, May 17, 2024 at 10:53:24AM GMT, Casey Schaufler wrote: > > Of course they do. I have been following the use of capabilities > > in Linux since before they were implemented. The uptake has been > > disappointing in all use cases. > > Why "Of course"? > What if they should not get *all* privileges? They do the job given a real-world workload and stress test. Here the problem is based on a theory and an experiment. Even a formal model does not necessarily map all "unknown unknowns". BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-18 11:08 ` Jarkko Sakkinen @ 2024-05-18 11:17 ` Jarkko Sakkinen 2024-05-18 11:21 ` Jarkko Sakkinen 0 siblings, 1 reply; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-18 11:17 UTC (permalink / raw To: Jarkko Sakkinen, Jonathan Calmels, Casey Schaufler Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Sat May 18, 2024 at 2:08 PM EEST, Jarkko Sakkinen wrote: > On Fri May 17, 2024 at 10:11 PM EEST, Jonathan Calmels wrote: > > On Fri, May 17, 2024 at 10:53:24AM GMT, Casey Schaufler wrote: > > > Of course they do. I have been following the use of capabilities > > > in Linux since before they were implemented. The uptake has been > > > disappointing in all use cases. > > > > Why "Of course"? > > What if they should not get *all* privileges? > > They do the job given a real-world workload and stress test. > > Here the problem is based on a theory and an experiment. > > Even a formal model does not necessarily map all "unknown unknowns". So this was like the worst "sales pitch" ever: 1. The cover letter starts with the idea of having to argue about name spaces, and have fun while doing that ;-) We all have our own ways to entertain ourselves but "name space duels" are not my thing. Why not just start with why we all want this instead? Maybe we don't want it then. Maybe this is just useless spam given the angle presented? 2. There's shitloads of computer science and set theory but nothing that would make common sense. You need to build more understandable model. There's zero "gist" in this work. Maybe this does make sense but the story around it sucks so far. BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-18 11:17 ` Jarkko Sakkinen @ 2024-05-18 11:21 ` Jarkko Sakkinen 2024-05-21 13:57 ` John Johansen 0 siblings, 1 reply; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-18 11:21 UTC (permalink / raw To: Jarkko Sakkinen, Jonathan Calmels, Casey Schaufler Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Sat May 18, 2024 at 2:17 PM EEST, Jarkko Sakkinen wrote: > On Sat May 18, 2024 at 2:08 PM EEST, Jarkko Sakkinen wrote: > > On Fri May 17, 2024 at 10:11 PM EEST, Jonathan Calmels wrote: > > > On Fri, May 17, 2024 at 10:53:24AM GMT, Casey Schaufler wrote: > > > > Of course they do. I have been following the use of capabilities > > > > in Linux since before they were implemented. The uptake has been > > > > disappointing in all use cases. > > > > > > Why "Of course"? > > > What if they should not get *all* privileges? > > > > They do the job given a real-world workload and stress test. > > > > Here the problem is based on a theory and an experiment. > > > > Even a formal model does not necessarily map all "unknown unknowns". > > So this was like the worst "sales pitch" ever: > > 1. The cover letter starts with the idea of having to argue about name > spaces, and have fun while doing that ;-) We all have our own ways to > entertain ourselves but "name space duels" are not my thing. Why not > just start with why we all want this instead? Maybe we don't want it > then. Maybe this is just useless spam given the angle presented? > 2. There's shitloads of computer science and set theory but nothing > that would make common sense. You need to build more understandable > model. There's zero "gist" in this work. > > Maybe this does make sense but the story around it sucks so far. One tip: I think this is wrong forum to present namespace ideas in the first place. It would be probably better to talk about this with e.g. systemd or podman developers, and similar groups. There's zero evidence of the usefulness. Then when you go that route and come back with actual users, things click much more easily. Now this is all in the void. BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-18 11:21 ` Jarkko Sakkinen @ 2024-05-21 13:57 ` John Johansen 2024-05-21 14:12 ` Jarkko Sakkinen 0 siblings, 1 reply; 53+ messages in thread From: John Johansen @ 2024-05-21 13:57 UTC (permalink / raw To: Jarkko Sakkinen, Jonathan Calmels, Casey Schaufler Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On 5/18/24 04:21, Jarkko Sakkinen wrote: > On Sat May 18, 2024 at 2:17 PM EEST, Jarkko Sakkinen wrote: >> On Sat May 18, 2024 at 2:08 PM EEST, Jarkko Sakkinen wrote: >>> On Fri May 17, 2024 at 10:11 PM EEST, Jonathan Calmels wrote: >>>> On Fri, May 17, 2024 at 10:53:24AM GMT, Casey Schaufler wrote: >>>>> Of course they do. I have been following the use of capabilities >>>>> in Linux since before they were implemented. The uptake has been >>>>> disappointing in all use cases. >>>> >>>> Why "Of course"? >>>> What if they should not get *all* privileges? >>> >>> They do the job given a real-world workload and stress test. >>> >>> Here the problem is based on a theory and an experiment. >>> >>> Even a formal model does not necessarily map all "unknown unknowns". >> >> So this was like the worst "sales pitch" ever: >> >> 1. The cover letter starts with the idea of having to argue about name >> spaces, and have fun while doing that ;-) We all have our own ways to >> entertain ourselves but "name space duels" are not my thing. Why not >> just start with why we all want this instead? Maybe we don't want it >> then. Maybe this is just useless spam given the angle presented? >> 2. There's shitloads of computer science and set theory but nothing >> that would make common sense. You need to build more understandable >> model. There's zero "gist" in this work. >> >> Maybe this does make sense but the story around it sucks so far. > > One tip: I think this is wrong forum to present namespace ideas in the > first place. It would be probably better to talk about this with e.g. > systemd or podman developers, and similar groups. There's zero evidence > of the usefulness. Then when you go that route and come back with actual > users, things click much more easily. Now this is all in the void. > > BR, Jarkko Jarkko, this is very much the right forum. User namespaces exist today. This is a discussion around trying to reduce the exposed kernel surface that is being used to attack the kernel. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-21 13:57 ` John Johansen @ 2024-05-21 14:12 ` Jarkko Sakkinen 2024-05-21 14:45 ` John Johansen 0 siblings, 1 reply; 53+ messages in thread From: Jarkko Sakkinen @ 2024-05-21 14:12 UTC (permalink / raw To: John Johansen, Jonathan Calmels, Casey Schaufler Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Tue May 21, 2024 at 4:57 PM EEST, John Johansen wrote: > > One tip: I think this is wrong forum to present namespace ideas in the > > first place. It would be probably better to talk about this with e.g. > > systemd or podman developers, and similar groups. There's zero evidence > > of the usefulness. Then when you go that route and come back with actual > > users, things click much more easily. Now this is all in the void. > > > > BR, Jarkko > > Jarkko, > > this is very much the right forum. User namespaces exist today. This > is a discussion around trying to reduce the exposed kernel surface > that is being used to attack the kernel. Agreed, that was harsh way to put it. What I mean is that if this feature was included, would it be enabled by distributions? This user base part or potential user space part is not very well described in the cover letter. I.e. "motivation" to put it short. I mean the technical details are really in detail in this patch set but it would help to digest them if there was some even rough description how this would be deployed. If the motivation should be obvious, then it is beyond me, and thus would be nice if that obvious thing was stated that everyone else gets. E.g. I like to sometimes just test quite alien patch sets for the sake of learning and fun (or not so fun, depends) but this patch set does not deliver enough information to do anything at all. Hope this clears a bit where I stand. IMHO a good patch set should bring the details to the specialists on the topic but also have some wider audience motivational stuff in order to make clear where it fits in this world :-) BR, Jarkko ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-21 14:12 ` Jarkko Sakkinen @ 2024-05-21 14:45 ` John Johansen 2024-05-22 0:45 ` Jonathan Calmels 0 siblings, 1 reply; 53+ messages in thread From: John Johansen @ 2024-05-21 14:45 UTC (permalink / raw To: Jarkko Sakkinen, Jonathan Calmels, Casey Schaufler Cc: brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On 5/21/24 07:12, Jarkko Sakkinen wrote: > On Tue May 21, 2024 at 4:57 PM EEST, John Johansen wrote: >>> One tip: I think this is wrong forum to present namespace ideas in the >>> first place. It would be probably better to talk about this with e.g. >>> systemd or podman developers, and similar groups. There's zero evidence >>> of the usefulness. Then when you go that route and come back with actual >>> users, things click much more easily. Now this is all in the void. >>> >>> BR, Jarkko >> >> Jarkko, >> >> this is very much the right forum. User namespaces exist today. This >> is a discussion around trying to reduce the exposed kernel surface >> that is being used to attack the kernel. > > Agreed, that was harsh way to put it. What I mean is that if this > feature was included, would it be enabled by distributions? > Enabled, maybe? It requires the debian distros to make sure their packaging supports xattrs correctly. It should be good but it isn't well exercised. It also requires the work to set these on multiple applications. From experience we are talking 100s. It will break out of repo applications, and require an extra step for users to enable. Ubuntu is already breaking these but for many, of the more popular ones they are shipping profiles so the users don't have to take an extra step. Things like appimages remain broken and wil require an approach similar to the Mac with unverified software downloaded from the internet. Nor does this fix the bwrap, unshare, ... use case. Which means the distro is going to have to continue shipping an alternate solution that covers those. For Ubuntu atm this is just an extra point of friction but I expect we would still end up enabling it to tick the checkbox at some point if it goes into the upstream kernel. > This user base part or potential user space part is not very well > described in the cover letter. I.e. "motivation" to put it short. > yes the cover letter needs work > I mean the technical details are really in detail in this patch set but > it would help to digest them if there was some even rough description > how this would be deployed. > yes > If the motivation should be obvious, then it is beyond me, and thus > would be nice if that obvious thing was stated that everyone else gets. > sure. The cover letter will get updated with this. Seeing as I have been dealing with this a lot lately. It comes down to user namespaces allow unprivileged code to access kernel surface area that is usually protected behind capabilities. This has been leveraged as part of the exploit chain in the majority of kernel exploits we are seeing. > E.g. I like to sometimes just test quite alien patch sets for the sake > of learning and fun (or not so fun, depends) but this patch set does not > deliver enough information to do anything at all. > under stood, I am playing devils advocate here. Its not that I don't see value in the proposal, but that I am not sure I see enough value with the current situation, where so much code has been written around the assumption that unprivileged user namespaces are safe. Trying to fix the situation without breaking everything is complicated. > Hope this clears a bit where I stand. IMHO a good patch set should bring > the details to the specialists on the topic but also have some wider > audience motivational stuff in order to make clear where it fits in this > world :-) > > BR, Jarkko > ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-21 14:45 ` John Johansen @ 2024-05-22 0:45 ` Jonathan Calmels 2024-05-31 7:43 ` John Johansen 0 siblings, 1 reply; 53+ messages in thread From: Jonathan Calmels @ 2024-05-22 0:45 UTC (permalink / raw To: John Johansen Cc: Jarkko Sakkinen, Casey Schaufler, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Tue, May 21, 2024 at 07:45:20AM GMT, John Johansen wrote: > On 5/21/24 07:12, Jarkko Sakkinen wrote: > > On Tue May 21, 2024 at 4:57 PM EEST, John Johansen wrote: > > > > One tip: I think this is wrong forum to present namespace ideas in the > > > > first place. It would be probably better to talk about this with e.g. > > > > systemd or podman developers, and similar groups. There's zero evidence > > > > of the usefulness. Then when you go that route and come back with actual > > > > users, things click much more easily. Now this is all in the void. > > > > > > > > BR, Jarkko > > > > > > Jarkko, > > > > > > this is very much the right forum. User namespaces exist today. This > > > is a discussion around trying to reduce the exposed kernel surface > > > that is being used to attack the kernel. > > > > Agreed, that was harsh way to put it. What I mean is that if this > > feature was included, would it be enabled by distributions? > > > Enabled, maybe? It requires the debian distros to make sure their > packaging supports xattrs correctly. It should be good but it isn't > well exercised. It also requires the work to set these on multiple > applications. From experience we are talking 100s. > > It will break out of repo applications, and require an extra step for > users to enable. Ubuntu is already breaking these but for many, of the > more popular ones they are shipping profiles so the users don't have > to take an extra step. Things like appimages remain broken and wil > require an approach similar to the Mac with unverified software > downloaded from the internet. > > Nor does this fix the bwrap, unshare, ... use case. Which means the > distro is going to have to continue shipping an alternate solution > that covers those. For Ubuntu atm this is just an extra point of > friction but I expect we would still end up enabling it to tick the > checkbox at some point if it goes into the upstream kernel. I'm not sure I understand your point here and how this relates to xattrs. This patchset has nothing to do with file capabilities. The userns capability set is purely a process based capability set and in no way influenced by file attributes. > > This user base part or potential user space part is not very well > > described in the cover letter. I.e. "motivation" to put it short. > > > yes the cover letter needs work Yes, it's been mentioned several times already. While not in the cover letter, the motivation is stated in the first patch and provides several references to past discussions on the topic. This is nothing new, this subject has been contentious for years now and discussed over and over on these lists (Eric would know :)). As mentioned in the patch also, this recently warranted the inclusion of new LSM hooks. But again, I wrongfully assumed that this problem was well understood and still relatively fresh, that's my bad. > > I mean the technical details are really in detail in this patch set but > > it would help to digest them if there was some even rough description > > how this would be deployed. > > > yes Yes, this was purposefully left out so as not to influence any specific implementation. There is a mention of where this could be done (i.e. init, pam), but at the end of the day, this is going to depend on each use case. Having said that, since it appears to be confusing, maybe we could add some of the examples I sent out in this thread or the other ones. I want to reiterate that this is a generic capability set, this is not magic switch you turn on to secure the whole system. Its implementation is going to vary across environments and it is going to be dictated by your threat model. For example, John's threat model of securing a multi-user Ubuntu Desktop is going to be very different than say securing a server where all the userspace is fixed and known. The former might require additional integration with the LSM subsystem. Thankfully, this patch should synergize well with it. Fundamentally, and at its core, it's very simple. Serge put it nicely: > If you want root in a child user namespace to not have CAP_MAC_ADMIN, > you drop it from your pU. Simple as that. From there, you can imagine any integration you want in userspace and ways to enforce your own policies. TLDR, this is a first step towards empowering userspace with control over capabilities granted by a userns. At present, the kernel does not offer ways to do this. By itself, it is not a comprehensive solution designed to thwart threat actors. However, it gives userspace the option to do so. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-22 0:45 ` Jonathan Calmels @ 2024-05-31 7:43 ` John Johansen 0 siblings, 0 replies; 53+ messages in thread From: John Johansen @ 2024-05-31 7:43 UTC (permalink / raw To: Jonathan Calmels Cc: Jarkko Sakkinen, Casey Schaufler, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Serge Hallyn, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On 5/21/24 17:45, Jonathan Calmels wrote: > On Tue, May 21, 2024 at 07:45:20AM GMT, John Johansen wrote: >> On 5/21/24 07:12, Jarkko Sakkinen wrote: >>> On Tue May 21, 2024 at 4:57 PM EEST, John Johansen wrote: >>>>> One tip: I think this is wrong forum to present namespace ideas in the >>>>> first place. It would be probably better to talk about this with e.g. >>>>> systemd or podman developers, and similar groups. There's zero evidence >>>>> of the usefulness. Then when you go that route and come back with actual >>>>> users, things click much more easily. Now this is all in the void. >>>>> >>>>> BR, Jarkko >>>> >>>> Jarkko, >>>> >>>> this is very much the right forum. User namespaces exist today. This >>>> is a discussion around trying to reduce the exposed kernel surface >>>> that is being used to attack the kernel. >>> >>> Agreed, that was harsh way to put it. What I mean is that if this >>> feature was included, would it be enabled by distributions? >>> >> Enabled, maybe? It requires the debian distros to make sure their >> packaging supports xattrs correctly. It should be good but it isn't >> well exercised. It also requires the work to set these on multiple >> applications. From experience we are talking 100s. >> >> It will break out of repo applications, and require an extra step for >> users to enable. Ubuntu is already breaking these but for many, of the >> more popular ones they are shipping profiles so the users don't have >> to take an extra step. Things like appimages remain broken and wil >> require an approach similar to the Mac with unverified software >> downloaded from the internet. >> >> Nor does this fix the bwrap, unshare, ... use case. Which means the >> distro is going to have to continue shipping an alternate solution >> that covers those. For Ubuntu atm this is just an extra point of >> friction but I expect we would still end up enabling it to tick the >> checkbox at some point if it goes into the upstream kernel. > > I'm not sure I understand your point here and how this relates to xattrs. > This patchset has nothing to do with file capabilities. The userns > capability set is purely a process based capability set and in no way > influenced by file attributes. > Oopps sorry the fcaps bit is crossing over a side discussion. >>> This user base part or potential user space part is not very well >>> described in the cover letter. I.e. "motivation" to put it short. >>> >> yes the cover letter needs work > > Yes, it's been mentioned several times already. > While not in the cover letter, the motivation is stated in the first > patch and provides several references to past discussions on the topic. > > This is nothing new, this subject has been contentious for years now and > discussed over and over on these lists (Eric would know :)). As > mentioned in the patch also, this recently warranted the inclusion of > new LSM hooks. > > But again, I wrongfully assumed that this problem was well understood > and still relatively fresh, that's my bad. > >>> I mean the technical details are really in detail in this patch set but >>> it would help to digest them if there was some even rough description >>> how this would be deployed. >>> >> yes > > Yes, this was purposefully left out so as not to influence any specific > implementation. There is a mention of where this could be done (i.e. > init, pam), but at the end of the day, this is going to depend on each > use case. > Having said that, since it appears to be confusing, maybe we could add > some of the examples I sent out in this thread or the other ones. > examples would help, especially for people not too familiar with this. > I want to reiterate that this is a generic capability set, this is not > magic switch you turn on to secure the whole system. > Its implementation is going to vary across environments and it is going > to be dictated by your threat model. > yeah > For example, John's threat model of securing a multi-user Ubuntu Desktop > is going to be very different than say securing a server where all the > userspace is fixed and known. > The former might require additional integration with the LSM subsystem. > Thankfully, this patch should synergize well with it. > hrmmm, maybe, I will be happy if they just don't end up complicating each other > Fundamentally, and at its core, it's very simple. Serge put it nicely: > yes it is, and yet it still worries me a great deal. I have some of the same worries as Casey, and also worry that people will take this as a solution for all use cases, without understanding the issues. On the other hand walking back the current state of unprivileged use of user namespaces is a huge issue. Having another approach also pushing will actually be helpful in some ways. >> If you want root in a child user namespace to not have CAP_MAC_ADMIN, >> you drop it from your pU. Simple as that. > > From there, you can imagine any integration you want in userspace and > ways to enforce your own policies. > > TLDR, this is a first step towards empowering userspace with control > over capabilities granted by a userns. At present, the kernel does not > offer ways to do this. By itself, it is not a comprehensive solution yep > designed to thwart threat actors. However, it gives userspace the option > to do so. again, I don't believe the capabilities system is actually capable of doing this, it covers some of the use cases. To be fair the LSM doesn't cover everything either, there are current use cases that just aren't safe, you either break them or allow them and accept the risks. It relies on people understanding threat models, and sadly I have become grown quite grumpy about that topic. Anyways I will try to finish up my review of the code this weekend. ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-17 17:53 ` Casey Schaufler 2024-05-17 19:11 ` Jonathan Calmels @ 2024-05-18 12:20 ` Serge Hallyn 2024-05-19 17:03 ` Casey Schaufler 2024-05-21 14:29 ` John Johansen 1 sibling, 2 replies; 53+ messages in thread From: Serge Hallyn @ 2024-05-18 12:20 UTC (permalink / raw To: Casey Schaufler Cc: Jonathan Calmels, Jarkko Sakkinen, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings, serge On Fri, May 17, 2024 at 10:53:24AM -0700, Casey Schaufler wrote: > On 5/17/2024 4:42 AM, Jonathan Calmels wrote: > >>>> On Thu May 16, 2024 at 10:07 PM EEST, Casey Schaufler wrote: > >>>>> I suggest that adding a capability set for user namespaces is a bad idea: > >>>>> - It is in no way obvious what problem it solves > >>>>> - It is not obvious how it solves any problem > >>>>> - The capability mechanism has not been popular, and relying on a > >>>>> community (e.g. container developers) to embrace it based on this > >>>>> enhancement is a recipe for failure > >>>>> - Capabilities are already more complicated than modern developers > >>>>> want to deal with. Adding another, special purpose set, is going > >>>>> to make them even more difficult to use. > > Sorry if the commit wasn't clear enough. > > While, as others have pointed out, the commit description left > much to be desired, that isn't the biggest problem with the change > you're proposing. > > > Basically: > > > > - Today user namespaces grant full capabilities. > > Of course they do. I have been following the use of capabilities > in Linux since before they were implemented. The uptake has been > disappointing in all use cases. > > > This behavior is often abused to attack various kernel subsystems. > > Yes. The problems of a single, all powerful root privilege scheme are > well documented. > > > Only option > > Hardly. > > > is to disable them altogether which breaks a lot of > > userspace stuff. > > Updating userspace components to behave properly in a capabilities > environment has never been a popular activity, but is the right way > to address this issue. And before you start on the "no one can do that, > it's too hard", I'll point out that multiple UNIX systems supported > rootless, all capabilities based systems back in the day. > > > This goes against the least privilege principle. > > If you're going to run userspace that *requires* privilege, you have > to have a way to *allow* privilege. If the userspace insists on a root > based privilege model, you're stuck supporting it. Regardless of your > principles. Casey, I might be wrong, but I think you're misreading this patchset. It is not about limiting capabilities in the init user ns at all. It's about limiting the capabilities which a process in a child userns can get. Any unprivileged task can create a new userns, and get a process with all capabilities in that namespace. Always. User namespaces were a great success in that we can do this without any resulting privilege against host owned resources. The unaddressed issue is the expanded kernel code surface area. You say, above, (quoting out of place here) > Updating userspace components to behave properly in a capabilities > environment has never been a popular activity, but is the right way > to address this issue. And before you start on the "no one can do that, > it's too hard", I'll point out that multiple UNIX systems supported He's not saying no one can do that. He's saying, correctly, that the kernel currently offers no way for userspace to do this limiting. His patchset offers two ways: one system wide capability mask (which applies only to non-initial user namespaces) and on per-process inherited one which - yay - userspace can use to limit what its children will be able to get if they unshare a user namespace. > > - It adds a new capability set. > > Which is a really, really bad idea. The equation for calculating effective > privilege is already more complicated than userspace developers are generally > willing to put up with. This is somewhat true, but I think the semantics of what is proposed here are about as straightforward as you could hope for, and you can basically reason about them completely independently of the other sets. Only when reasoning about the correctness of this code do you need to consider the other sets. Not when administering a system. If you want root in a child user namespace to not have CAP_MAC_ADMIN, you drop it from your pU. Simple as that. > > This set dictates what capabilities are granted in namespaces (instead > > of always getting full caps). > > I would not expect container developers to be eager to learn how to use > this facility. I'm a container developer, and I'm excited about it :) > > This brings namespaces in line with the rest of the system, user > > namespaces are no more "special". > > I'm sorry, but this makes no sense to me whatsoever. You want to introduce > a capability set explicitly for namespaces in order to make them less > special? Yes, exactly. > Maybe I'm just old and cranky. That's fine. > > They now work the same way as say a transition to root does with > > inheritable caps. > > That needs some explanation. > > > > > - This isn't intended to be used by end users per se (although they could). > > This would be used at the same places where existing capabalities are > > used today (e.g. init system, pam, container runtime, browser > > sandbox), or by system administrators. > > I understand that. It is for containers. Containers are not kernel entities. User namespaces are. This patch set provides userspace a way of limiting the kernel code exposed to untrusted children, which currently does not exist. > > To give you some ideas of things you could do: > > > > # E.g. prevent alice from getting CAP_NET_ADMIN in user namespaces under SSH > > echo "auth optional pam_cap.so" >> /etc/pam.d/sshd > > echo "!cap_net_admin alice" >> /etc/security/capability.conf. > > > > # E.g. prevent any Docker container from ever getting CAP_DAC_OVERRIDE > > systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \ > > -p SecureBits=userns-strict-caps \ > > /usr/bin/dockerd > > > > # E.g. kernel could be vulnerable to CAP_SYS_RAWIO exploits > > # Prevent users from ever gaining it > > sysctl -w cap_bound_userns_mask=0x1fffffdffff ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-18 12:20 ` Serge Hallyn @ 2024-05-19 17:03 ` Casey Schaufler 2024-05-20 0:54 ` Jonathan Calmels 2024-05-21 14:29 ` John Johansen 1 sibling, 1 reply; 53+ messages in thread From: Casey Schaufler @ 2024-05-19 17:03 UTC (permalink / raw To: Serge Hallyn Cc: Jonathan Calmels, Jarkko Sakkinen, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings, Casey Schaufler On 5/18/2024 5:20 AM, Serge Hallyn wrote: > On Fri, May 17, 2024 at 10:53:24AM -0700, Casey Schaufler wrote: >> On 5/17/2024 4:42 AM, Jonathan Calmels wrote: >>>>>> On Thu May 16, 2024 at 10:07 PM EEST, Casey Schaufler wrote: >>>>>>> I suggest that adding a capability set for user namespaces is a bad idea: >>>>>>> - It is in no way obvious what problem it solves >>>>>>> - It is not obvious how it solves any problem >>>>>>> - The capability mechanism has not been popular, and relying on a >>>>>>> community (e.g. container developers) to embrace it based on this >>>>>>> enhancement is a recipe for failure >>>>>>> - Capabilities are already more complicated than modern developers >>>>>>> want to deal with. Adding another, special purpose set, is going >>>>>>> to make them even more difficult to use. >>> Sorry if the commit wasn't clear enough. >> While, as others have pointed out, the commit description left >> much to be desired, that isn't the biggest problem with the change >> you're proposing. >> >>> Basically: >>> >>> - Today user namespaces grant full capabilities. >> Of course they do. I have been following the use of capabilities >> in Linux since before they were implemented. The uptake has been >> disappointing in all use cases. >> >>> This behavior is often abused to attack various kernel subsystems. >> Yes. The problems of a single, all powerful root privilege scheme are >> well documented. >> >>> Only option >> Hardly. >> >>> is to disable them altogether which breaks a lot of >>> userspace stuff. >> Updating userspace components to behave properly in a capabilities >> environment has never been a popular activity, but is the right way >> to address this issue. And before you start on the "no one can do that, >> it's too hard", I'll point out that multiple UNIX systems supported >> rootless, all capabilities based systems back in the day. >> >>> This goes against the least privilege principle. >> If you're going to run userspace that *requires* privilege, you have >> to have a way to *allow* privilege. If the userspace insists on a root >> based privilege model, you're stuck supporting it. Regardless of your >> principles. > Casey, > > I might be wrong, but I think you're misreading this patchset. It is not > about limiting capabilities in the init user ns at all. It's about limiting > the capabilities which a process in a child userns can get. I do understand that. My objection is not to the intent, but to the approach. Adding a capability set to the general mechanism in support of a limited, specific use case seems wrong to me. I would rather see a mechanism in userns to limit the capabilities in a user namespace than a mechanism in capabilities that is specific to user namespaces. > Any unprivileged task can create a new userns, and get a process with > all capabilities in that namespace. Always. User namespaces were a > great success in that we can do this without any resulting privilege > against host owned resources. The unaddressed issue is the expanded > kernel code surface area. An option to clone() then, to limit the capabilities available? I honestly can't recall if that has been suggested elsewhere, and apologize if it's already been dismissed as a stoopid idea. > > You say, above, (quoting out of place here) > >> Updating userspace components to behave properly in a capabilities >> environment has never been a popular activity, but is the right way >> to address this issue. And before you start on the "no one can do that, >> it's too hard", I'll point out that multiple UNIX systems supported > He's not saying no one can do that. He's saying, correctly, that the > kernel currently offers no way for userspace to do this limiting. His > patchset offers two ways: one system wide capability mask (which applies > only to non-initial user namespaces) and on per-process inherited one > which - yay - userspace can use to limit what its children will be > able to get if they unshare a user namespace. > >>> - It adds a new capability set. >> Which is a really, really bad idea. The equation for calculating effective >> privilege is already more complicated than userspace developers are generally >> willing to put up with. > This is somewhat true, but I think the semantics of what is proposed here are > about as straightforward as you could hope for, and you can basically reason > about them completely independently of the other sets. Only when reasoning > about the correctness of this code do you need to consider the other sets. Not > when administering a system. > > If you want root in a child user namespace to not have CAP_MAC_ADMIN, you drop > it from your pU. Simple as that. > >>> This set dictates what capabilities are granted in namespaces (instead >>> of always getting full caps). >> I would not expect container developers to be eager to learn how to use >> this facility. > I'm a container developer, and I'm excited about it :) OK, well, I'm wrong. It's happened before and will happen again. > >>> This brings namespaces in line with the rest of the system, user >>> namespaces are no more "special". >> I'm sorry, but this makes no sense to me whatsoever. You want to introduce >> a capability set explicitly for namespaces in order to make them less >> special? > Yes, exactly. Hmm. I can't say I buy that. It makes a whole lot more sense to me to change userns than to change capabilities. > >> Maybe I'm just old and cranky. > That's fine. > >>> They now work the same way as say a transition to root does with >>> inheritable caps. >> That needs some explanation. >> >>> - This isn't intended to be used by end users per se (although they could). >>> This would be used at the same places where existing capabalities are >>> used today (e.g. init system, pam, container runtime, browser >>> sandbox), or by system administrators. >> I understand that. It is for containers. Containers are not kernel entities. > User namespaces are. > > This patch set provides userspace a way of limiting the kernel code exposed > to untrusted children, which currently does not exist. Yes, I understand. I would rather see a change to userns in support of a userns specific need than a change to capabilities for a userns specific need. >>> To give you some ideas of things you could do: >>> >>> # E.g. prevent alice from getting CAP_NET_ADMIN in user namespaces under SSH >>> echo "auth optional pam_cap.so" >> /etc/pam.d/sshd >>> echo "!cap_net_admin alice" >> /etc/security/capability.conf. >>> >>> # E.g. prevent any Docker container from ever getting CAP_DAC_OVERRIDE >>> systemd-run -p CapabilityBoundingSet=~CAP_DAC_OVERRIDE \ >>> -p SecureBits=userns-strict-caps \ >>> /usr/bin/dockerd >>> >>> # E.g. kernel could be vulnerable to CAP_SYS_RAWIO exploits >>> # Prevent users from ever gaining it >>> sysctl -w cap_bound_userns_mask=0x1fffffdffff ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-19 17:03 ` Casey Schaufler @ 2024-05-20 0:54 ` Jonathan Calmels 0 siblings, 0 replies; 53+ messages in thread From: Jonathan Calmels @ 2024-05-20 0:54 UTC (permalink / raw To: Casey Schaufler Cc: Serge Hallyn, Jarkko Sakkinen, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On Sun, May 19, 2024 at 10:03:29AM GMT, Casey Schaufler wrote: > I do understand that. My objection is not to the intent, but to the approach. > Adding a capability set to the general mechanism in support of a limited, specific > use case seems wrong to me. I would rather see a mechanism in userns to limit > the capabilities in a user namespace than a mechanism in capabilities that is > specific to user namespaces. > An option to clone() then, to limit the capabilities available? > I honestly can't recall if that has been suggested elsewhere, and > apologize if it's already been dismissed as a stoopid idea. No and you're right, this would also make sense. This was considered as well as things like ioctl_ns() (basically introducing the concept of capabilities in the user_namespace struct). I also considered reusing the existing sets with various schemes to no avail. The main issue with this approach is that you've to consider how this is going to be used. This ties into the other thread we've had with John and Eric. Basically, we're coming from a model where things are wide open and we're trying to tighten things down. Quoting John here: > We are starting from a different posture here. Where applications have > assumed that user namespaces where safe and no measures were needed. > Tools like unshare and bwrap if set to allow user namespaces in their > fcaps will allow exploits a trivial by-pass. We can't really expect userspace to patch every single userns callsite and opt-in this new security mechanism. You said it well yourself: > Capabilities are already more complicated than modern developers > want to deal with. Moreover, policies are not necessarily enforced at said callsites. Take for example a service like systemd-machined, or a PAM session. Those need to be able to place restrictions on any processes spawned under them. If we do this in clone() (or similar), we'll also need to come up with inheritance rules, being able to query capabilities, etc. At this point we're just reinventing capability sets. Finally the nice thing about having it as a capability set, is that we can easily define rules between them. Patch 2 is a good example of this. It constrains the userns set to the bounding set of a task. Thus, requiring minimal/no change to userspace, and helping with adoption. > Yes, I understand. I would rather see a change to userns in support of a userns > specific need than a change to capabilities for a userns specific need. Valid point, but at the end of the day, those are really just tasks' capabilities. The unshare() just happens to trigger specific rules when it comes to the tasks' creds. This isn't so different than the other sets and their specific rules for execve() or UID 0. This could also be reframed as: Why would setting capabilities on taks in a userns be so different than tasks outside of it? ^ permalink raw reply [flat|nested] 53+ messages in thread
* Re: [PATCH 0/3] Introduce user namespace capabilities 2024-05-18 12:20 ` Serge Hallyn 2024-05-19 17:03 ` Casey Schaufler @ 2024-05-21 14:29 ` John Johansen 1 sibling, 0 replies; 53+ messages in thread From: John Johansen @ 2024-05-21 14:29 UTC (permalink / raw To: Serge Hallyn, Casey Schaufler Cc: Jonathan Calmels, Jarkko Sakkinen, brauner, ebiederm, Luis Chamberlain, Kees Cook, Joel Granados, Paul Moore, James Morris, David Howells, containers, linux-kernel, linux-fsdevel, linux-security-module, keyrings On 5/18/24 05:20, Serge Hallyn wrote: > On Fri, May 17, 2024 at 10:53:24AM -0700, Casey Schaufler wrote: >> On 5/17/2024 4:42 AM, Jonathan Calmels wrote: >>>>>> On Thu May 16, 2024 at 10:07 PM EEST, Casey Schaufler wrote: >>>>>>> I suggest that adding a capability set for user namespaces is a bad idea: >>>>>>> - It is in no way obvious what problem it solves >>>>>>> - It is not obvious how it solves any problem >>>>>>> - The capability mechanism has not been popular, and relying on a >>>>>>> community (e.g. container developers) to embrace it based on this >>>>>>> enhancement is a recipe for failure >>>>>>> - Capabilities are already more complicated than modern developers >>>>>>> want to deal with. Adding another, special purpose set, is going >>>>>>> to make them even more difficult to use. >>> Sorry if the commit wasn't clear enough. >> >> While, as others have pointed out, the commit description left >> much to be desired, that isn't the biggest problem with the change >> you're proposing. >> >>> Basically: >>> >>> - Today user namespaces grant full capabilities. >> >> Of course they do. I have been following the use of capabilities >> in Linux since before they were implemented. The uptake has been >> disappointing in all use cases. >> >>> This behavior is often abused to attack various kernel subsystems. >> >> Yes. The problems of a single, all powerful root privilege scheme are >> well documented. >> >>> Only option >> >> Hardly. >> >>> is to disable them altogether which breaks a lot of >>> userspace stuff. >> >> Updating userspace components to behave properly in a capabilities >> environment has never been a popular activity, but is the right way >> to address this issue. And before you start on the "no one can do that, >> it's too hard", I'll point out that multiple UNIX systems supported >> rootless, all capabilities based systems back in the day. >> >>> This goes against the least privilege principle. >> >> If you're going to run userspace that *requires* privilege, you have >> to have a way to *allow* privilege. If the userspace insists on a root >> based privilege model, you're stuck supporting it. Regardless of your >> principles. > > Casey, > > I might be wrong, but I think you're misreading this patchset. It is not > about limiting capabilities in the init user ns at all. It's about limiting > the capabilities which a process in a child userns can get. > > Any unprivileged task can create a new userns, and get a process with > all capabilities in that namespace. Always. User namespaces were a > great success in that we can do this without any resulting privilege > against host owned resources. The unaddressed issue is the expanded > kernel code surface area. > > You say, above, (quoting out of place here) > >> Updating userspace components to behave properly in a capabilities >> environment has never been a popular activity, but is the right way >> to address this issue. And before you start on the "no one can do that, >> it's too hard", I'll point out that multiple UNIX systems supported > > He's not saying no one can do that. He's saying, correctly, that the > kernel currently offers no way for userspace to do this limiting. His > patchset offers two ways: one system wide capability mask (which applies > only to non-initial user namespaces) and on per-process inherited one > which - yay - userspace can use to limit what its children will be > able to get if they unshare a user namespace. > >>> - It adds a new capability set. >> >> Which is a really, really bad idea. The equation for calculating effective >> privilege is already more complicated than userspace developers are generally >> willing to put up with. > > This is somewhat true, but I think the semantics of what is proposed here are > about as straightforward as you could hope for, and you can basically reason > about them completely independently of the other sets. Only when reasoning > about the correctness of this code do you need to consider the other sets. Not > when administering a system. > > If you want root in a child user namespace to not have CAP_MAC_ADMIN, you drop > it from your pU. Simple as that. > >>> This set dictates what capabilities are granted in namespaces (instead >>> of always getting full caps). >> >> I would not expect container developers to be eager to learn how to use >> this facility. > > I'm a container developer, and I'm excited about it :) > >>> This brings namespaces in line with the rest of the system, user >>> namespaces are no more "special". >> >> I'm sorry, but this makes no sense to me whatsoever. You want to introduce >> a capability set explicitly for namespaces in order to make them less >> special? > > Yes, exactly. > >> Maybe I'm just old and cranky. > > That's fine. > >>> They now work the same way as say a transition to root does with >>> inheritable caps. >> >> That needs some explanation. >> >>> >>> - This isn't intended to be used by end users per se (although they could). >>> This would be used at the same places where existing capabalities are >>> used today (e.g. init system, pam, container runtime, browser >>> sandbox), or by system administrators. >> >> I understand that. It is for containers. Containers are not kernel entities. > > User namespaces are. > > This patch set provides userspace a way of limiting the kernel code exposed > to untrusted children, which currently does not exist. > theoretically, I am worried that in practice the existing utils allow untrusted code to still access user namespaces. In practice we have found that we need to allow a different set of capabilities when bwrap is called from flatpak than when called on its own etc. We see the same pattern with unshare and other utilities around launching applications in user namespaces. In practice at the distro level I don't see this approach actually helping. Because we have so many uses that require exposing close to the full capabilities set in multiple utilities that are required by many different applications. To be clear this doesn't stop distros from doing something more, but is it worth the added complexity if in practice it can't be used effectively. I really don't have the answer. ^ permalink raw reply [flat|nested] 53+ messages in thread
end of thread, other threads:[~2024-05-31 7:43 UTC | newest] Thread overview: 53+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-16 9:22 [PATCH 0/3] Introduce user namespace capabilities Jonathan Calmels 2024-05-16 9:22 ` [PATCH 1/3] capabilities: " Jonathan Calmels 2024-05-16 12:27 ` Jarkko Sakkinen 2024-05-16 22:07 ` John Johansen 2024-05-17 10:51 ` Jonathan Calmels 2024-05-17 11:59 ` John Johansen 2024-05-18 3:50 ` Jonathan Calmels 2024-05-18 12:27 ` John Johansen 2024-05-19 1:33 ` Jonathan Calmels 2024-05-17 11:32 ` Eric W. Biederman 2024-05-17 11:55 ` Jonathan Calmels 2024-05-17 12:48 ` John Johansen 2024-05-17 14:22 ` Eric W. Biederman 2024-05-17 18:02 ` Jonathan Calmels 2024-05-21 15:52 ` John Johansen 2024-05-20 3:30 ` Serge E. Hallyn 2024-05-20 3:36 ` Serge E. Hallyn 2024-05-16 9:22 ` [PATCH 2/3] capabilities: add securebit for strict userns caps Jonathan Calmels 2024-05-16 12:42 ` Jarkko Sakkinen 2024-05-20 3:38 ` Serge E. Hallyn 2024-05-16 9:22 ` [PATCH 3/3] capabilities: add cap userns sysctl mask Jonathan Calmels 2024-05-16 12:44 ` Jarkko Sakkinen 2024-05-20 3:38 ` Serge E. Hallyn 2024-05-20 13:30 ` Tycho Andersen 2024-05-20 19:25 ` Jonathan Calmels 2024-05-20 21:13 ` Tycho Andersen 2024-05-20 22:12 ` Jarkko Sakkinen 2024-05-21 14:29 ` Tycho Andersen 2024-05-21 14:45 ` Jarkko Sakkinen 2024-05-16 13:30 ` [PATCH 0/3] Introduce user namespace capabilities Ben Boeckel 2024-05-16 13:36 ` Jarkko Sakkinen 2024-05-17 10:00 ` Jonathan Calmels 2024-05-16 16:23 ` Paul Moore 2024-05-16 17:18 ` Jarkko Sakkinen 2024-05-16 19:07 ` Casey Schaufler 2024-05-16 19:29 ` Jarkko Sakkinen 2024-05-16 19:31 ` Jarkko Sakkinen 2024-05-16 20:00 ` Jarkko Sakkinen 2024-05-17 11:42 ` Jonathan Calmels 2024-05-17 17:53 ` Casey Schaufler 2024-05-17 19:11 ` Jonathan Calmels 2024-05-18 11:08 ` Jarkko Sakkinen 2024-05-18 11:17 ` Jarkko Sakkinen 2024-05-18 11:21 ` Jarkko Sakkinen 2024-05-21 13:57 ` John Johansen 2024-05-21 14:12 ` Jarkko Sakkinen 2024-05-21 14:45 ` John Johansen 2024-05-22 0:45 ` Jonathan Calmels 2024-05-31 7:43 ` John Johansen 2024-05-18 12:20 ` Serge Hallyn 2024-05-19 17:03 ` Casey Schaufler 2024-05-20 0:54 ` Jonathan Calmels 2024-05-21 14:29 ` John Johansen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).