mirror of
https://https.git.savannah.gnu.org/git/guix.git/
synced 2025-07-12 01:50:46 +02:00
The container that slirp4netns runs in should already be quite difficult to do anything malicious in beyond basic denial of service or sending of network traffic. There is, however, one hole remaining in the case in which there is an adversary able to run code locally: abstract unix sockets. Because these are governed by network namespaces, not IPC namespaces, and slirp4netns is in the root network namespace, any process in the root network namespace can cooperate with the slirp4netns process to take over its user. To close this, we use seccomp to block the creation of unix-domain sockets by slirp4netns. This requires some finesse, since slirp4netns absolutely needs to be able to create other types of sockets - at minimum AF_INET and AF_INET6 Seccomp has many, many pitfalls. To name a few: 1. Seccomp provides you with an "arch" field, but this does not uniquely determine the ABI being used; the actual meaning of a system call number depends on both the number (which is often the result of ORing a related system call with a flag for an alternate ABI) and the architecture. 2. Seccomp provides no direct way of knowing what the native value for the arch field should be; the user must do configure/compile-time testing for every architecture+ABI combination they want to support. Amusingly enough, the linux-internal header files have this exact information (SECCOMP_ARCH_NATIVE), but they aren't sharing it. 3. The only system call numbers we naturally have are the native ones in asm/unistd.h. __NR_socket will always refer to the system call number for the target system's ABI. 4. Seccomp can only manipulate 32-bit words, but represents every system call argument as a uint64. 5. New system call numbers with as-yet-unknown semantics can be added to the kernel at any time. 6. Based on this comment in arch/x86/entry/syscalls/syscall_32.tbl: # 251 is available for reuse (was briefly sys_set_zone_reclaim) previously-invalid system call numbers may later be reused for new system calls. 7. Most architecture+ABI combinations have system call tables with many gaps in them. arm-eabi, for example, has 35 such gaps (note: this is just the number of distinct gaps, not the number of system call numbers contained in those gaps). 8. Seccomp's BPF filters require a fully-acyclic control flow graph. Any operation on a data structure must therefore first be fully unrolled before it can be run. 9. Seccomp cannot dereference pointers. Only the raw bits provided to the system calls can be inspected. 10. Some architecture+ABI combos have multiplexer system calls. For example, socketcall can perform any socket-related system call. The arguments to the multiplexed system call are passed indirectly, via a pointer to user memory. They therefore cannot be inspected by seccomp. 11. Some valid system calls are not listed in any table in the kernel source. For example, __ARM_NR_cacheflush is an "ARM private" system call. It does not appear in any *.tbl file. 12. Conditional branches are limited to relative jumps of at most 256 instructions forward. 13. Prior to Linux 4.8, any process able to spawn another process and call ptrace could bypass seccomp restrictions. To address (1), (2), and (3), we include preprocessor checks to identify the native architecture value, and reject all system calls that don't use the native architecture. To address (4), we use the AC_C_BIGENDIAN autoconf check to conditionally define WORDS_BIGENDIAN, and match up the proper portions of any uint64 we test for with the value in the accumulator being tested against. To address (5) and (6), we use system call pinning. That is, we hardcode a snapshot of all the valid system call numbers at the time of writing, and reject any system call numbers not in the recorded set. A set is recorded for every architecture+ABI combo, and the native one is chosen at compile-time. This ensures that not only are non-native architectures rejected, but so are non-native ABIs. For the sake of conciseness, we represent these sets as sets of disjoint ranges. Due to (7), checking each range in turn could add a lot of overhead to each system call, so we instead binary search through the ranges. Due to (8), this binary search has to be fully unrolled, so we do that too. It can be tedious and error-prone to manually produce the syscall ranges by looking at linux's *.tbl files, since the gaps are often small and uncommented. To address this, a script, build-aux/extract-syscall-ranges.sh, is added that will produce them given a *.tbl filename and an ABI regex (some tables seem to abuse the ABI field with strange values like "memfd_secret"). Note that producing the final values still requires looking at the proper asm/unistd.h file to find any private numbers and to identify any offsets and ABI variants used. (10) used to have no good solution, but in the past decade most architectures have gained dedicated system call alternatives to at least socketcall, so we can (hopefully) just block it entirely. To address (13), we block ptrace also. * build-aux/extract-syscall-ranges.sh: new script. * Makefile.am (EXTRA_DIST): register it. * config-daemon.ac: use AC_C_BIGENDIAN. * nix/libutil/spawn.cc (setNoNewPrivsAction, addSeccompFilterAction): new functions. * nix/libutil/spawn.hh (setNoNewPrivsAction, addSeccompFilterAction): new declarations. (SpawnContext)[setNoNewPrivs, addSeccompFilter]: new fields. * nix/libutil/seccomp.hh: new header file. * nix/libutil/seccomp.cc: new file. * nix/local.mk (libutil_a_SOURCES, libutil_headers): register them. * nix/libstore/build.cc (slirpSeccompFilter, writeSeccompFilterDot): new functions. (spawnSlirp4netns): use them, set seccomp filter for slirp4netns. Change-Id: Ic92c7f564ab12596b87ed0801b22f88fbb543b95 Signed-off-by: John Kehayias <john.kehayias@protonmail.com>
174 lines
6.3 KiB
C++
174 lines
6.3 KiB
C++
#pragma once
|
|
|
|
#include <util.hh>
|
|
#include <map>
|
|
#include <stddef.h>
|
|
#ifdef __linux__
|
|
#include <linux/filter.h>
|
|
#endif
|
|
|
|
namespace nix {
|
|
struct SpawnContext; /* Forward declaration */
|
|
typedef void (Action)(SpawnContext & ctx);
|
|
|
|
struct Phase {
|
|
string label;
|
|
Action * action;
|
|
};
|
|
|
|
typedef std::vector<Phase> Phases;
|
|
|
|
/* Common structure read from / written to by setup phases in a newly-spawned
|
|
child process. Configure this to determine which per-process or
|
|
per-thread attributes should be set. */
|
|
struct SpawnContext {
|
|
ssize_t currentPhase = 0;
|
|
Phases phases;
|
|
Strings args; /* Will be passed as-is to execve, does not implicitly add
|
|
* program basename as argv[0]! */
|
|
Path program;
|
|
bool inheritEnv = true; /* True to use the current environment after env
|
|
* has been applied to it, false to use strictly
|
|
* env. */
|
|
std::map<string, string> env;
|
|
bool setPersona = false;
|
|
int persona;
|
|
int logFD = -1; /* -1 to keep stdout and stderr */
|
|
set<int> earlyCloseFDs; /* Typically for closing inherited unused pipe or
|
|
* socket ends to prevent hangs when reading or
|
|
* writing. */
|
|
bool closeMostFDs = false;
|
|
set<int> preserveFDs; /* 0, 1, and 2 are always implicitly preserved. */
|
|
bool setStdin = false;
|
|
int stdinFD = -1; /* fd or -1 */
|
|
Path stdinFile; /* used if stdinFD == -1 */
|
|
bool setuid = false;
|
|
uid_t user;
|
|
bool setgid = false;
|
|
gid_t group;
|
|
bool setSupplementaryGroups = false;
|
|
std::vector<gid_t> supplementaryGroups;
|
|
bool setsid = false;
|
|
bool oomSacrifice = false; /* Whether to attempt to offer the child
|
|
* process to the OOM killer if possible. */
|
|
bool setcwd = false;
|
|
Path cwd;
|
|
bool signalSetupSuccess = false; /* Whether the parent is waiting for a
|
|
* message that setup succeeded. By
|
|
* default success is signaled by
|
|
* writing a single newline to stderr. */
|
|
bool dropAmbientCapabilities = false; /* Whether to drop ambient
|
|
* capabilities if on a system that
|
|
* supports them. */
|
|
bool setNoNewPrivs = false;
|
|
bool addSeccompFilter = false;
|
|
#if __linux__
|
|
std::vector<struct sock_filter> seccompFilter;
|
|
#endif
|
|
bool doChroot = false;
|
|
Path chrootRootDir;
|
|
void * extraData; /* Extra user data */
|
|
};
|
|
|
|
/* Like SpawnContext, but with extra fields for setting up Linux namespaces,
|
|
as created by clone or unshare. */
|
|
struct CloneSpawnContext : SpawnContext {
|
|
int cloneFlags = 0;
|
|
std::map<Path, Path> filesInChroot; /* map from path inside chroot to
|
|
* path outside of chroot */
|
|
set<Path> readOnlyFilesInChroot;
|
|
bool mountTmpfsOnChroot = false; /* req. CLONE_NEWNS and doChroot */
|
|
bool mountProc = false;
|
|
bool mountDevshm = false;
|
|
bool maybeMountDevpts = false; /* Only mounted if /dev/ptmx doesn't exist
|
|
* after any chroot, if applicable. */
|
|
bool lockMounts = false; /* Whether to lock mounts by creating a fresh
|
|
* user and mount namespace, see
|
|
* mount_namespaces(7). */
|
|
bool lockMountsMapAll = false; /* Whether to map all currently-mapped
|
|
users and groups when locking mounts or
|
|
only the current ones. */
|
|
bool lockMountsAllowSetgroups = false;
|
|
int setupFD = -1; /* Used for userns init sync and other stuff */
|
|
string hostname; /* Requires CLONE_NEWUTS */
|
|
string domainname; /* Same */
|
|
bool initLoopback = false; /* Also requires CLONE_NEWNET in cloneFlags */
|
|
/* These may be used if CLONE_NEWUSER in cloneFlags. These are to be
|
|
used when an id other than the current uid/gid has been mapped into the
|
|
child's user namespace, and it now needs to setuid/setgid to an id
|
|
that is mapped. */
|
|
bool usernsSetuid = false;
|
|
uid_t usernsUser;
|
|
bool usernsSetgid = false;
|
|
gid_t usernsGroup;
|
|
};
|
|
|
|
void addPhaseAfter(Phases & phases, string afterLabel, string addLabel, Action addAction);
|
|
|
|
void addPhaseBefore(Phases & phases, string beforeLabel, string addLabel, Action addAction);
|
|
|
|
void prependPhase(Phases & phases, string addLabel, Action addAction);
|
|
|
|
void appendPhase(Phases & phases, string addLabel, Action addAction);
|
|
|
|
void deletePhase(Phases & phases, string delLabel);
|
|
|
|
void replacePhase(Phases & phases, string replaceLabel, Action newAction);
|
|
|
|
Action reset_writeToStderrAction;
|
|
Action restoreAffinityAction;
|
|
Action setsidAction;
|
|
Action earlyIOSetupAction;
|
|
Action dropAmbientCapabilitiesAction;
|
|
Action chrootAction;
|
|
Action chdirAction;
|
|
Action closeMostFDsAction;
|
|
Action setPersonalityAction;
|
|
Action oomSacrificeAction;
|
|
Action setIDsAction;
|
|
Action setNoNewPrivsAction;
|
|
Action addSeccompFilterAction;
|
|
Action restoreSIGPIPEAction;
|
|
Action setupSuccessAction;
|
|
Action execAction;
|
|
|
|
Phases getBasicSpawnPhases();
|
|
|
|
void bindMount(Path source, Path target, bool readOnly);
|
|
|
|
void mountIntoChroot(std::map<Path, Path> filesInChroot,
|
|
set<Path> readOnlyFiles,
|
|
Path chrootRootDir);
|
|
|
|
Action usernsInitSyncAction;
|
|
Action usernsSetIDsAction;
|
|
Action initLoopbackAction;
|
|
Action setHostAndDomainAction;
|
|
Action makeFilesystemsPrivateAction;
|
|
Action makeChrootSeparateFilesystemAction;
|
|
Action mountIntoChrootAction;
|
|
Action mountProcAction;
|
|
Action mountDevshmAction;
|
|
Action mountDevptsAction;
|
|
Action pivotRootAction;
|
|
Action lockMountsAction;
|
|
|
|
Phases getCloneSpawnPhases();
|
|
|
|
/* Helpers */
|
|
string idMapToIdentityMap(const string & map);
|
|
void unshareAndInitUserns(int flags, const string & uidMap,
|
|
const string & gidMap, bool allowSetgroups);
|
|
|
|
/* Run the phases of ctx in order, catching and reporting any exception, and
|
|
* exiting in all cases. */
|
|
void runChildSetup(SpawnContext & ctx);
|
|
|
|
/* Helper to call runChildSetup that can be passed to the variant of clone
|
|
* that expects a callback. */
|
|
int runChildSetupEntry(void *data);
|
|
|
|
/* Create a new process using clone that will immediately call runChildSetup
|
|
* with the provided CloneSpawnContext. Return the pid of the new process. */
|
|
int cloneChild(CloneSpawnContext & ctx);
|
|
}
|