Home X Github About

Linux Expert Syscalls

Expert-Level Linux Syscalls & Kernel Interfaces

A comprehensive reference for senior performance engineers: syscalls, kernel interfaces, and mechanisms that go beyond the basics. Each entry covers what it does, why it matters, who uses it, and the sharp edges.


CPU Performance & Execution

  • sched_yield() vs spin-wait vs futex(FUTEX_WAIT) — The tradeoff triangle. sched_yield() is almost always wrong — it yields to any runnable thread, not the one holding your lock. Spin-wait with pause instruction (x86 _mm_pause()) is correct for <1μs waits on dedicated cores. futex is correct for everything else. Senior engineers know sched_yield in a spinlock is an anti-pattern that causes priority inversion and cache thrashing.
  • getcpu() via vDSO — Not a real syscall — the kernel maps a page into every process containing the current CPU/NUMA node. Costs ~1ns vs ~100ns for a real syscall. Used to index per-CPU data structures. rdtscp also returns the CPU ID in the IA32_TSC_AUX register if you want even lower overhead.
  • arch_prctl(ARCH_SET_CPUID) (x86) — Trap on CPUID instruction execution. Used by emulators and VMs to intercept and rewrite CPUID results without full hardware virtualization. Also used by performance-critical code to detect when the kernel migrates your thread mid-execution (the CPUID trap fires on the new core).
  • prctl(PR_SET_SPECULATION_CTRL) — Per-process control over Spectre mitigations. PR_SPEC_DISABLE forces IBRS/STIBP (safe but ~5-15% overhead). PR_SPEC_FORCE_DISABLE locks it for the process and all descendants. HFT shops sometimes use PR_SPEC_ENABLE to reclaim that 5-15% on isolated cores, accepting the Spectre risk on a machine processing no untrusted code.
  • prctl(PR_TASK_PERF_EVENTS_DISABLE / ENABLE) — Pause and resume all perf_event_open counters for the calling task. Used for precise measurement windows — disable counters, run setup code, enable counters, run the hot path, disable counters. Eliminates noise from initialization.
  • UMCG (User-Managed Concurrency Groups) — (proposed, 5.x+) Cooperative scheduling between userspace and kernel. A userspace scheduler (Go runtime, Java virtual threads) can context-switch between tasks without entering the kernel, but still get notified when a task blocks on I/O. Google's internal fiber library drove this. Lets you implement M:N threading with proper kernel integration rather than the hacks Go currently uses.
  • sched_setattr() with SCHED_FLAG_UTIL_CLAMP — Clamp the CPU frequency scaling hint for a thread. util_min says "never drop below this frequency" (prevents clock ramp-up latency on the fast path). util_max says "never exceed this frequency" (power cap). Android uses this extensively — UI thread gets util_min=512, background sync gets util_max=256.
  • membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ) — Force all threads in the process to restart their rseq critical sections. If you update a per-CPU data structure from a remote thread, you need this to ensure no reader is mid-critical-section with stale data. The cost is an IPI to each core — expensive, but cheaper than taking a lock on the read side.
  • perf_event_open() with PERF_COUNT_HW_REF_CPU_CYCLES vs PERF_COUNT_HW_CPU_CYCLES — Reference cycles count at a fixed rate regardless of frequency scaling. CPU cycles count at the actual core frequency. The ratio tells you your effective clock speed. If cycles/ref_cycles is 0.6, your code is running at 60% max frequency — probably throttled by power or thermal limits. This is how you diagnose "the code is correct but slow in production" when the answer is CPU throttling.
  • /sys/devices/cpu/rdpmc + mmap of perf counter — After perf_event_open, you can mmap the counter and read it from userspace via RDPMC instruction. Costs ~20 cycles vs ~1000 cycles for a read() syscall. Nanobenchmark frameworks use this for per-function IPC, cache miss, and branch misprediction counting.
  • sched_setattr() with SCHED_DEADLINE — Earliest Deadline First scheduling. You declare runtime, deadline, and period — the kernel guarantees CPU time. Used in real-time audio, robotics, and low-latency trading.
  • perf_event_open() — Direct access to hardware performance counters (cache misses, branch mispredictions, TLB misses, IPC). The underlying syscall behind perf. You can set up ring buffers for sampling with zero syscall overhead in the steady state.
  • clone3() — The modern version of clone() with extensible struct-based args. Supports CLONE_INTO_CGROUP (spawn directly into a cgroup), CLONE_PIDFD (get a pidfd atomically at creation).
  • sched_setaffinity() + getcpu() — Pin threads to cores. getcpu() tells you which core you're on without a syscall (uses vDSO). Critical for NUMA-aware data structures and per-core sharding.
  • futex() (specifically FUTEX_WAIT_BITSET, FUTEX_LOCK_PI) — Priority-inheritance futexes prevent priority inversion. Bitset waiters allow selective wakeup of specific threads. The foundation of every serious userspace synchronization primitive.
  • rseq() (restartable sequences) — Register a per-CPU critical section that the kernel restarts if the thread migrates. Enables per-CPU data structures without locks or atomics — glibc's malloc uses this since glibc 2.35.

Memory — Allocators, Huge Pages & NUMA

  • madvise(MADV_HUGEPAGE) on specific VMAs vs system-wide THP — The expert move is to disable THP globally (/sys/kernel/mm/transparent_hugepage/enabled = madvise) and then MADV_HUGEPAGE only on your known-hot regions (buffer pool, JIT code cache, hash tables). This gets huge page benefits without khugepaged compaction storms on your heap. ScyllaDB, ClickHouse, and every serious database does this.
  • MAP_HUGETLB | MAP_HUGE_1GB — Explicit 1GB huge pages. Eliminates TLB misses entirely for large memory regions — a 256GB buffer pool needs only 256 TLB entries instead of 67 million. Requires boot-time reservation (hugepagesz=1G hugepages=64). The kernel cannot compact 1GB pages at runtime, so they must be reserved early. Database vendors (Oracle, SAP HANA) require this in production tuning guides.
  • mbind(MPOL_INTERLEAVE) — Interleave allocations round-robin across NUMA nodes. Counter-intuitive: for shared data structures accessed by all cores, interleaving is faster than local allocation because it distributes memory bandwidth across all memory controllers. Linux's page cache uses interleaving by default for exactly this reason.
  • move_pages() with status query (NULL new_nodes) — Query the NUMA location of specific pages without moving them. Build a heatmap of where your data actually lives. If your "NUMA-local" allocation ended up on the wrong node (because of memory pressure and rebalancing), this tells you. Then use migrate_pages() or move_pages() to fix it.
  • set_mempolicy(MPOL_PREFERRED_MANY) (5.15+) — Prefer allocation from a set of NUMA nodes instead of just one. Falls back to other nodes only when all preferred nodes are exhausted. Designed for CXL memory tiers where you have DRAM + CXL-attached DRAM and want a preference order without hard binding.
  • MADV_DONTFORK / MADV_WIPEONFORKDONTFORK unmaps the region in the child after fork. WIPEONFORK zeros it. Critical for security-sensitive memory (encryption keys, ASLR secrets). Without DONTFORK, a forked child inherits your crypto key material via CoW pages. OpenSSL should use this on its entropy pool but famously didn't for years.
  • mprotect() with PROT_MTE (ARM64 MTE) — Enable Memory Tagging Extension on a memory region. Every 16-byte granule gets a 4-bit color tag. Pointers carry a tag in their top bits. Mismatched tag = synchronous fault. Catches use-after-free and buffer overflows in hardware at near-zero cost. Android 14+ enables this for system daemons. The PROT_MTE flag on mmap/mprotect controls which regions are tagged.
  • process_mrelease() (5.15+) — Release the memory of an already-dying process faster. When you kill -9 a huge-memory process, the kernel can take seconds to free hundreds of GB (holding mmap_lock, stalling everything). This syscall tells the kernel "prioritize reaping this process's pages." OOM killers and container runtimes use this so a dead 512GB JVM doesn't stall the entire host for 10 seconds.
  • KSM (Kernel Same-page Merging) via madvise(MADV_MERGEABLE) — The kernel periodically scans marked regions and CoW-merges identical pages. A hypervisor running 100 VMs with the same OS image saves ~30-50% memory. The tradeoff: the scanning thread (ksmd) costs CPU, and CoW faults when pages diverge can cause latency spikes. Expert tuning involves pages_to_scan, sleep_millisecs, and per-process MADV_MERGEABLE targeting.
  • cachestat() (6.5+) — Query how many pages of a file range are in the page cache, dirty, evicted, etc. — in a single syscall. Before this, you had to mincore() every page individually (one bit per page, no dirty info). Database engines use this to decide whether to use buffered I/O (data already cached) or direct I/O (data cold, skip the cache).
  • madvise(MADV_HUGEPAGE / MADV_FREE / MADV_DONTNEED / MADV_SEQUENTIAL) — Fine-grained hints to the kernel's page cache. MADV_FREE (lazy free) is cheaper than DONTNEED because it defers actual page reclaim. MADV_POPULATE_READ/WRITE (5.14+) pre-faults pages without touching them from userspace — eliminates page fault storms on large allocations.
  • mlock2(MCL_ONFAULT) — Lock pages into RAM only when faulted in, not eagerly. Critical for latency-sensitive apps that allocate large arenas but use them sparsely.
  • process_madvise() — Apply madvise to another process's address space. Used by memory management daemons and container runtimes.
  • userfaultfd() — Get notified in userspace when a page fault occurs. Enables live migration (QEMU/CRIU), lazy restore of checkpointed processes, and userspace-managed paging (post-copy migration). Also used for custom allocators and CRIU checkpoint/restore.
  • memfd_create() + memfd_secret()memfd_create gives you anonymous file-backed memory (great for zero-copy IPC with sendmsg/SCM_RIGHTS). memfd_secret() (5.14+) creates memory that's unmapped from the kernel's direct map — even root/kernel can't read it. Defense against Spectre/Meltdown side-channels.
  • mbind() / set_mempolicy() / move_pages() — NUMA memory placement. mbind binds memory ranges to specific NUMA nodes. move_pages migrates individual pages between nodes. Essential for database buffer pools and HPC.

Virtual Memory & TLB Tricks

  • prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME) — Name anonymous VMAs so they show up labeled in /proc/pid/maps. Android uses this extensively for debugging memory — you can tell your allocator's arenas apart from mmap'd files. Trivial to use, almost nobody outside Android/game engines knows it exists.
  • mmap(MAP_FIXED_NOREPLACE) — Like MAP_FIXED but fails instead of silently unmapping existing mappings. Eliminates an entire class of bugs where you accidentally clobber your own address space. Database buffer pools should always use this.
  • mmap(MAP_POPULATE | MAP_LOCKED) — Pre-fault and wire all pages at mmap time. Eliminates both minor faults and swapout. The nuclear option for latency — used in HFT shared memory segments.
  • MADV_COLD / MADV_PAGEOUT (5.4+) — MADV_COLD demotes pages to the inactive list (they'll be reclaimed first under pressure). MADV_PAGEOUT immediately reclaims them to swap/zswap. Android's ActivityManager uses these to manage app memory tiers. Lets you implement your own multi-tier memory policy from userspace.
  • mremap(MREMAP_DONTUNMAP) (5.7+) — Remap pages to a new address but leave the old VMA in place as empty (zero-fill-on-demand). Enables atomic-ish relocation of mappings for concurrent data structures — readers on the old address get zeroes instead of SIGSEGV while you update pointers. Used in userspace garbage collectors.
  • PR_SET_THP_DISABLE — Disable transparent huge pages for a specific process. When your allocation pattern fragments 2MB pages, THP's compaction daemon (khugepaged) causes multi-millisecond tail latency spikes. Redis, ScyllaDB, and most databases disable THP for this reason — this prctl lets you do it per-process instead of system-wide.
  • mincore() — Query which pages of a mapping are resident in the page cache. Lets you build page-cache-aware prefetching — only issue readahead for ranges not already cached. Database query engines use this to avoid redundant I/O on hot data.

Storage & Direct I/O

  • O_DIRECT + io_uring with registered buffers — The holy grail of storage I/O. O_DIRECT bypasses the page cache (no double-buffering, predictable memory usage). io_uring with IORING_REGISTER_BUFFERS + IORING_OP_READ_FIXED pre-registers DMA-able memory regions so the kernel skips get_user_pages() on every I/O. Combined with IORING_SETUP_SQPOLL, you get DMA from NVMe directly to your buffer pool with zero syscalls and zero page table walks in the hot path.
  • pwritev2() with RWF_DSYNC — Per-I/O O_DSYNC without opening the file with O_DSYNC. Lets you choose per write whether you need durability. Database WAL writes use RWF_DSYNC, but data page writes go buffered and sync_file_range() later. Avoids the all-or-nothing O_DSYNC flag on the fd.
  • pwritev2() with RWF_NOWAIT — Return EAGAIN if the write would block (e.g., needs journal space, memory allocation, or inode lookup). Combined with io_uring, this lets you keep the fast path non-blocking and fall back to the kernel worker thread pool only when necessary. Avoids head-of-line blocking in your submission queue.
  • pwritev2() with RWF_APPEND — Atomic append without O_APPEND on the fd. Each write atomically fetches-and-advances the file offset. Multiple threads can append to a log file without locks or O_APPEND (which forces all writes to append, even seeks).
  • preadv2() with RWF_HIPRI — Request polling-based completion instead of interrupt-driven. The kernel busy-polls the NVMe completion queue for this specific I/O. Reduces latency by ~5-10μs (the interrupt latency). Only useful for ultra-low-latency NVMe devices; on SATA it does nothing.
  • fallocate(FALLOC_FL_ZERO_RANGE) vs FALLOC_FL_PUNCH_HOLE — Both create "holes" but differently. ZERO_RANGE guarantees zeroed reads but may or may not deallocate blocks (filesystem-dependent). PUNCH_HOLE deallocates blocks and the file becomes sparse. For a WAL that recycles segments, ZERO_RANGE is correct (zeroed data, pre-allocated extents preserved). PUNCH_HOLE is for freeing space (log compaction, sparse snapshots).
  • ioctl(BLKDISCARD) / ioctl(BLKSECDISCARD) — Tell the SSD to TRIM/discard a range of LBAs. BLKDISCARD is a hint (SSD may or may not zero). BLKSECDISCARD is a guarantee (cryptographic erase). Database drop-table can issue BLKDISCARD on the underlying block range to reclaim SSD write amplification budget and maintain consistent performance.
  • ioctl(FICLONE) / ioctl(FICLONERANGE) — Reflink copy: create a CoW clone of a file (or byte range) instantly. Both files share the same physical extents until one is modified. On btrfs/XFS (with reflink), cloning a 100GB database file takes microseconds. Used for instant snapshots, test database provisioning, and fork-based MVCC.
  • io_uring IORING_OP_SPLICE + IORING_OP_TEE — Zero-copy data movement through io_uring. Chain a splice from a socket to a pipe and a tee from the pipe to a file — you've just logged all network traffic to disk with zero copies and zero syscalls in the steady state. Proxy servers and packet capture tools use this.
  • io_uring IORING_SETUP_REGISTERED_FD_ONLY + IORING_REGISTER_FILES — Pre-register file descriptors so io_uring uses direct indexing instead of fdget()/fdput() per operation. On a server with 100K open fds, the fdtable lock contention is measurable. Registered fds bypass the fdtable entirely.
  • NVMe passthrough via io_uring (IORING_OP_URING_CMD on /dev/ng0n1) — Send raw NVMe commands (admin or I/O) through io_uring. Bypass the entire block layer. You can issue vendor-specific commands (SSD internal telemetry, zone management, key-value commands on KV-SSDs), or do I/O with custom metadata (T10-PI, DIF). SPDK without leaving the kernel.
  • splice() / tee() / vmsplice() — Zero-copy data movement between file descriptors via kernel pipe buffers. vmsplice maps userspace pages into a pipe (true zero-copy). Nginx, HAProxy, and Kafka use these heavily.
  • copy_file_range() — Server-side copy between two fds, potentially without data touching userspace or even the CPU (on NFS, reflink-capable filesystems like btrfs/XFS).
  • fallocate(FALLOC_FL_PUNCH_HOLE / FALLOC_FL_COLLAPSE_RANGE / FALLOC_FL_INSERT_RANGE) — Surgically manipulate file extents. Punch holes for sparse files, collapse ranges to remove data without rewriting, insert ranges to add space mid-file. Database engines use this for WAL recycling and SST compaction.
  • sync_file_range() — Initiate writeback on a byte range without waiting or flushing metadata. Lets you pipeline data durability — start flushing page N while writing page N+1. PostgreSQL's WAL writer uses this.
  • open_by_handle_at() / name_to_handle_at() — Resolve files by filesystem-level handle (inode + generation), bypassing path resolution entirely. NFS servers and backup tools use these. Also a privilege escalation vector if misused.
  • readahead() — Explicitly trigger kernel readahead into the page cache for a file range. Useful when your access pattern is known but not sequential enough for the kernel's heuristic.
  • fadvise(POSIX_FADV_DONTNEED) — Evict pages from the page cache after you're done. Prevents a big sequential scan from polluting the cache for other workloads.

File System Internals

  • FS_IOC_FIEMAP (ioctl) — Query the exact physical block layout of a file on disk. Returns extent maps showing which logical byte ranges map to which physical disk offsets. Database storage engines use this to detect fragmentation and make defragmentation decisions. Also reveals if a file is sparse, preallocated, or reflinked.
  • FS_IOC_GETFLAGS / FS_IOC_SETFLAGS — Per-file flags like FS_IMMUTABLE_FL (even root can't modify), FS_APPEND_FL (append-only), FS_NOCOW_FL (disable copy-on-write on btrfs — critical for database files on btrfs, without it random writes trigger CoW storms). chattr is the CLI wrapper but the ioctl is what you use programmatically.
  • renameat2(RENAME_EXCHANGE) — Atomically swap two files. Both paths exist before and after. Used for atomic config updates — write new config to a temp file, then RENAME_EXCHANGE with the live config. If you crash mid-swap, you still have the old file at one path and the new file at the other. SQLite's WAL checkpointing could use this.
  • renameat2(RENAME_WHITEOUT) — Rename a file and leave a whiteout entry at the old path. Used by overlay filesystems (Docker's storage driver) — a whiteout means "this file was deleted in the upper layer" without physically removing the lower layer's copy.
  • statx() — The modern stat() with extensible struct. Returns stx_btime (birth/creation time — stat() can't do this), stx_dio_mem_align / stx_dio_offset_align (the exact alignment requirements for O_DIRECT on this file — no more guessing), stx_mount_id. The alignment fields alone are worth the switch — different filesystems and block devices have different direct I/O alignment requirements and this tells you exactly.
  • openat2() with RESOLVE_BENEATH | RESOLVE_NO_SYMLINKS | RESOLVE_NO_MAGICLINKS (5.6+) — Path resolution with security constraints. RESOLVE_BENEATH prevents escaping a directory (replaces O_NOFOLLOW + manual checks). RESOLVE_NO_MAGICLINKS blocks /proc/self/fd/N traversal. Container runtimes use this to prevent symlink attacks on bind mounts.

Profiling & Observability

  • perf_event_open() with PERF_SAMPLE_DATA_SRC — For each sampled memory access, tells you where the data came from: L1/L2/L3 hit, local DRAM, remote DRAM (cross-NUMA), I/O (MMIO). This is how you prove that your "NUMA-aware" data structure is actually hitting remote DRAM 40% of the time. perf mem record wraps this.
  • perf_event_open() with PERF_SAMPLE_WEIGHT — Each sample carries a weight (latency in cycles). Sort by weight to find the most expensive individual memory accesses, not just the most frequent. One L3 miss costs 200 cycles; finding the 10 instructions responsible for 90% of your stall cycles is the difference between a 2% and a 30% optimization.
  • perf_event_open() with PERF_SAMPLE_PHYS_ADDR — Sample the physical address of memory accesses. Map physical addresses to DIMM topology to identify a failing DIMM causing correctable ECC errors that slow down accesses. Also reveals DRAM bank conflicts and DRAM row buffer miss patterns.
  • Intel PEBS (Processor Event-Based Sampling) via perf_event_open() with precise_ip=3 — The CPU itself writes samples to a hardware buffer, then interrupts the kernel to drain it. Gives exact IP attribution (no skid). Normal sampling has 1-100 instruction skid — your profile says function A is hot but actually function B three instructions earlier caused the event. PEBS eliminates this.
  • AMD IBS (Instruction-Based Sampling) — AMD's answer to PEBS. Randomly selects instructions and records everything about their execution: fetch latency, cache hit level, TLB level, branch prediction correctness, completion latency. More comprehensive than PEBS — IBS Fetch and IBS Op give you the full microarchitectural story per instruction.
  • bpf() with BPF_PROG_TYPE_RAW_TRACEPOINT — Attach to raw tracepoints without the stable tracepoint ABI overhead. Access raw struct fields directly via BTF instead of going through the tracepoint format string parser. ~2x faster than regular tracepoints for high-frequency events (scheduler, memory allocator).
  • perf_event_open() with PERF_TYPE_BREAKPOINT — Hardware watchpoints via the debug registers. Monitor reads/writes to a specific memory address. Up to 4 simultaneous watchpoints on x86. Faster than page-fault-based watchpoints (mprotect tricks) and more precise. Used to answer "who is writing to this corrupted field?"
  • perf_event_open() with Intel PT (PERF_TYPE_INTEL_PT) — Full instruction-level tracing via hardware. The CPU logs every branch taken to a ring buffer at ~5% overhead. You can reconstruct the exact execution path post-mortem. Used for coverage-guided fuzzing (AFL/libFuzzer), reverse debugging (rr, UDB), and production tracing of rare bugs.
  • usdt (User Statically-Defined Tracepoints) via bpf() — Embed NOP sleds in your binary at key points. eBPF attaches to them at runtime with zero cost when disabled. PostgreSQL, MySQL, JVM, and Python have USDT probes. bpftrace -p $PID -e 'usdt:postgresql:query__start { printf("%s\n", str(arg0)); }' — production query tracing with no restart.
  • /proc/pid/smaps_rollup — Single-read summary of a process's memory: RSS, PSS (proportional share of shared pages), swap, anonymous, file-backed, huge pages. smaps gives per-VMA detail but is O(n) in VMAs and takes >100ms for large processes. smaps_rollup is O(1) summary. Monitoring daemons should always use smaps_rollup — reading full smaps for a JVM with 100K VMAs can stall the target process.
  • /proc/pressure/{cpu,memory,io} (PSI) — System-wide pressure stall information. cat /proc/pressure/memory gives you some avg10=0.50 avg60=0.25 avg300=0.10 total=12345 — the percentage of time some tasks are stalled on memory. You can poll() these files with trigger thresholds. Android and systemd-oomd use PSI triggers instead of free-memory thresholds for OOM decisions because PSI measures actual impact not just headroom.
  • pidfd_open() / pidfd_send_signal() / pidfd_getfd() — Race-free process management via file descriptors (no PID recycling issues). pidfd_getfd steals a file descriptor from another process — used by debuggers and container runtimes.
  • kcmp() — Compare kernel objects between two processes (do these two fds point to the same underlying file/socket/pipe?). Used by checkpoint/restore (CRIU).

Security

  • landlock_restrict_self() — Unprivileged filesystem sandboxing. A process voluntarily restricts itself to a set of directories/file types. Unlike seccomp (which filters syscalls), Landlock filters filesystem operations. A document renderer can say "I can only read from /tmp/render_input/ and write to /tmp/render_output/" — even if an exploit gives arbitrary code execution, it can't touch /etc/shadow.
  • seccomp(SECCOMP_SET_MODE_FILTER) with SECCOMP_FILTER_FLAG_TSYNC — Apply seccomp filter to all threads simultaneously. Without TSYNC, you have a race window where a thread spawned between clone and seccomp runs unfiltered. Every container runtime uses TSYNC because a sandbox with a gap is worse than no sandbox (false sense of security).
  • prctl(PR_SET_NO_NEW_PRIVS) — The process (and all descendants) can never gain privileges via execve of setuid/setgid binaries or file capabilities. Required before seccomp(SECCOMP_SET_MODE_FILTER) for unprivileged processes. Without it, a seccomp-filtered process could exec a setuid binary to escape. This is why it's mandatory.
  • prctl(PR_SET_SECCOMP) + SECCOMP_RET_LOG — Audit mode: log denied syscalls to the kernel audit log without killing the process. Run your production workload with a restrictive filter in log mode for a week, analyze the logs, then tighten. Iterative seccomp profile development instead of guessing.
  • memfd_secret() — Memory that's removed from the kernel's direct map. Even a kernel exploit (root + arbitrary kernel read) cannot read this memory without first re-mapping it. Designed for cryptographic key storage. The process sees it as normal mmap'd memory, but kptr and /proc/kcore cannot access it. Hardware cost: TLB pressure from splitting the direct map's huge pages.
  • mseal() (6.10+) — Immutable memory permissions. Once sealed, mprotect, munmap, mremap, mmap(MAP_FIXED) all fail on the sealed range. Prevents ROP gadgets from doing mprotect(PROT_EXEC) on your data. Chrome V8 seals JIT code pages after compilation. The runtime cost is near-zero — it's just a flag on the VMA.
  • prctl(PR_SET_MDWE) (6.3+) — Memory-Deny-Write-Execute. The process can never have a page that's both writable and executable simultaneously. mmap(PROT_WRITE|PROT_EXEC) fails. mprotect from W to X fails. You must write code, then mprotect to RX (W^X enforcement). Stops JIT spray attacks. systemd enables this for all its daemons.
  • prctl(PR_SET_TAGGED_ADDR_CTRL) with PR_TAGGED_ADDR_ENABLE (ARM64) — Enable Top Byte Ignore (TBI). The kernel ignores the top byte of pointers in syscalls. This is the foundation for ARM's Memory Tagging (MTE), HWASan, and pointer authentication. Combined with PROT_MTE, you get hardware-enforced memory safety.
  • keyctl() / add_key() / request_key() — The kernel keyring. Store crypto keys in kernel memory (not swappable, not visible in /proc/pid/mem). Keys can have access controls, expiration, and type-specific operations (asymmetric crypto, encrypted, trusted TPM-backed). dm-crypt, eCryptfs, fscrypt all use the keyring. KEYCTL_RESTRICT_KEYRING prevents unauthorized key injection.
  • ioctl(FS_IOC_SET_ENCRYPTION_POLICY) (fscrypt) — Per-directory encryption on ext4/f2fs/UBIFS. Each directory tree gets a different key. When the user locks their phone, the key is evicted from the keyring and the files become undecryptable ciphertext. Unlike full-disk encryption, this gives per-user/per-app key separation. Android and ChromeOS use this for at-rest encryption.
  • fanotify with FAN_OPEN_EXEC_PERM — Intercept and approve/deny every executable load. An anti-malware daemon gets an event before execve completes, can inspect the binary, and return FAN_ALLOW or FAN_DENY. This is Linux's answer to Windows' PsSetCreateProcessNotifyRoutine. ClamAV on-access scanning uses this.
  • ptrace(PTRACE_SECCOMP_GET_FILTER) — Dump the seccomp BPF filter of a traced process. Used by security auditors to verify that a container's seccomp profile actually matches what was intended. Without this, you're trusting that the runtime installed the right filter.
  • seccomp(SECCOMP_RET_USER_NOTIF) + ioctl(SECCOMP_IOCTL_NOTIF_ADDFD) — A supervisor process intercepts a sandboxed process's syscalls and can inject file descriptors into the target. The sandboxed process calls open("/dev/gpu"), the supervisor validates it, opens the real device, and injects the fd. The sandboxed process never touches the real device node. This is how GPU passthrough works in unprivileged containers.
  • seccomp(SECCOMP_SET_MODE_FILTER) — Install a BPF filter on all syscalls for the calling thread. Chrome, Docker, and every serious sandbox uses this. The SECCOMP_RET_USER_NOTIF mode lets a supervisor process make policy decisions in userspace.

Reliability & Fault Tolerance

  • prctl(PR_MCE_KILL, PR_MCE_KILL_EARLY) — Machine Check Exception handling policy. Without this, a process that touches a page with a hardware memory error gets killed silently (or worse, reads corrupted data if the error is correctable). With MCE_KILL_EARLY, you get SIGBUS(BUS_MCEERR_AO) immediately when the error is detected, even before you access the page. Database engines use this with a SIGBUS handler that marks the affected page as corrupt and initiates recovery from a replica.
  • madvise(MADV_HWPOISON) — Simulate a hardware memory error on a page. Used to test your MCE recovery path without actual bad RAM. Inject poison, verify your SIGBUS handler kicks in, verify your database marks the page bad and recovers. Essential for testing fault tolerance of in-memory databases.
  • MADV_SOFT_OFFLINE — Ask the kernel to migrate data off a page that has correctable ECC errors and take the physical page offline. Proactive repair: monitoring tools watch mcelog/rasdaemon for correctable errors, identify trending pages, and MADV_SOFT_OFFLINE them before they become uncorrectable. Hyperscalers (Google, Facebook) do this at fleet scale.
  • PR_GET_TIMERSLACK / PR_SET_TIMERSLACK — Control timer coalescing granularity. Default is 50μs — the kernel may delay your timer by up to 50μs to batch it with others (saving power). For latency-critical paths, set it to 1 (nanosecond precision). For background tasks, set it to 10ms (save power, reduce wakeups). Android sets different slack for foreground vs background apps.
  • ioprio_set(IOPRIO_CLASS_IDLE) — Set I/O scheduling priority. IOPRIO_CLASS_IDLE means "only do I/O when the disk is completely idle." Backup processes, log compaction, and background defragmentation should use this so they never interfere with latency-sensitive foreground I/O. Combine with IOPRIO_CLASS_RT for WAL writes that must never be delayed.
  • dup3() with O_CLOEXEC — Atomically duplicate an fd with close-on-exec. Without O_CLOEXEC, there's a race between dup2() and fcntl(FD_CLOEXEC) where a concurrent fork+exec leaks the fd to a child. This is a real security bug class — leaked fds to child processes have caused privilege escalation in production systems.
  • pipe2() / socket() / accept4() / epoll_create1() with O_CLOEXEC / SOCK_CLOEXEC — Same pattern. Every fd-creating syscall now has a CLOEXEC variant. Senior engineers use only these variants, never the legacy versions. The window between socket() and fcntl(FD_CLOEXEC) is a textbook fd leak.
  • renameat2(RENAME_EXCHANGE) for crash-safe updates — The pattern: write new data to file.tmp, fsync(file.tmp), fsync(directory), rename(file.tmp, file). But with RENAME_EXCHANGE, you swap file.tmp and file atomically — if you crash mid-swap, you still have the old file at one path and the new file at the other. Strictly better than rename() for crash safety.
  • sync_file_range(SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_BEFORE) — The PostgreSQL trick. Phase 1: SYNC_FILE_RANGE_WRITE (initiate writeback, don't wait). Phase 2: Continue writing more data. Phase 3: SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE (wait for phase 1 to finish, initiate writeback of phase 2). This pipelines fsync — you're always writing the next batch while the previous batch is flushing. 2x WAL throughput on fast SSDs.
  • fcntl(F_OFD_SETLK) — Open file description locks (POSIX 2008). Unlike fcntl(F_SETLK) (which is per-process and auto-releases when any fd to the file closes), OFD locks are per-open-file-description. Multi-threaded programs can have per-thread file locks without interference. The old POSIX locks were fundamentally broken for multithreaded programs — closing any fd released all locks held by the process.
  • pidfd_open() + pidfd_send_signal() + poll() — The complete PID-race-free lifecycle. Open a pidfd, poll it (becomes readable on exit), send signals via pidfd (never hits the wrong process). No TOCTOU between "check if alive" and "send signal." The old kill(pid, 0) check + kill(pid, SIGTERM) pattern has a race where the PID could be recycled between the two calls. In a containerized environment cycling through PIDs rapidly, this is a real bug.
  • close_range(3, ~0U, CLOSE_RANGE_CLOEXEC) — Mark fds 3 through MAX as close-on-exec in one syscall. Alternative to close_range(3, ~0U, 0) (which closes them immediately). Used in security-sensitive fork+exec paths to ensure no fd leaks without the cost of iterating /proc/self/fd.

Direct Hardware Access

  • iopl(3) / ioperm() — Grant userspace access to x86 I/O ports. Level 3 gives access to all ports. DPDK's legacy PCI driver mode uses this. You can directly read/write NIC registers, program DMA descriptors, and handle completions — all without entering the kernel. Requires CAP_SYS_RAWIO. Replaced by VFIO in modern stacks but still used for legacy industrial control hardware (PLCs, custom FPGA boards).
  • mmap(/dev/mem) or mmap(/dev/kmem) — Map physical memory into userspace. Used for MMIO access to hardware registers. Embedded developers use this to poke FPGA registers. On servers, it's used for accessing ACPI tables, BIOS regions, and PCIe BARs of devices without a kernel driver. Requires CONFIG_STRICT_DEVMEM=n or specific ranges whitelisted.
  • VFIO_SET_IOMMU + VFIO_MAP_DMA + VFIO_DEVICE_SET_IRQS — Full userspace device driver framework. Map a PCI device's BARs into userspace, set up DMA mappings via the IOMMU, and configure MSI-X interrupts delivered via eventfd. DPDK (NICs), SPDK (NVMe), and GPU passthrough (QEMU/KVM) all use VFIO. The IOMMU ensures DMA isolation — the device can only DMA to pages you've explicitly mapped.
  • /dev/vhost-net + ioctl(VHOST_SET_MEM_TABLE) — Kernel-side virtio data plane. Set up virtqueues and memory mappings, then the kernel handles packet forwarding between a VM's virtio NIC and the host network stack without any userspace involvement. QEMU sets this up and then the data path is pure kernel. 2-5x throughput improvement over userspace virtio.
  • ioctl(PERF_EVENT_IOC_SET_BPF) on a PMU event — Attach an eBPF program to a hardware performance counter overflow. Every N cache misses (or branch mispredictions, or TLB misses), your eBPF program runs with access to the instruction pointer, registers, and stack. Build custom profilers that collect application-specific context (request ID, tenant ID) alongside hardware events.
  • ioctl(KVM_CREATE_VM) + KVM_SET_USER_MEMORY_REGION + KVM_RUN — Build a hypervisor in userspace. Create a VM, map its physical memory (backed by your process's virtual memory), load code, and run it. KVM_RUN returns on VMEXIT (I/O, HLT, etc.), your code handles the exit, and you re-enter. Firecracker (AWS Lambda) is <50K lines of Rust doing exactly this. You can run a minimal VM in ~1000 lines of C.
  • /dev/uio* (Userspace I/O) — Map a device's MMIO registers and get interrupt notifications via read(). Simpler than VFIO but no IOMMU protection (device can DMA anywhere). Used for simple FPGA boards where you trust the hardware. Write to the mmap'd region = write to the device register. Read on the fd = block until interrupt.
  • pkey_alloc() / pkey_mprotect() / pkey_free() — Intel Memory Protection Keys. Partition your address space into up to 16 "domains" with per-thread read/write permissions. Change permissions via a single register write (WRPKRU, ~1 cycle) instead of mprotect (syscall, TLB flush). Use cases: per-thread isolation (each request handler can only access its own data), guard pages without TLB cost, and sandboxing untrusted code within the same address space.
  • io_uring with IORING_REGISTER_NAPI (6.x+) — Register NAPI (NIC polling) with io_uring. When waiting for network completions, io_uring busy-polls the NIC directly, bypassing softirq. Combines the latency benefits of SO_BUSY_POLL with io_uring's batching model. AWS ENA and Intel ice drivers support this.
  • /dev/udmabuf (udmabuf_create) — Create a DMA-buf backed by userspace memory. Allows zero-copy sharing between userspace, the GPU, and other DMA-capable devices. Used in graphics compositing (Wayland) and ML inference pipelines where you want to pass tensors between the CPU and an accelerator without copies.
  • perf_event_open() with PERF_TYPE_RAW — Access model-specific performance counters (Intel PEBS, AMD IBS). Things like: exact instruction that caused an L3 miss, load latency histograms, memory access source (L1/L2/L3/local DRAM/remote DRAM). This is how perf mem and Intel VTune work under the hood.
  • perf_event_open() with PERF_SAMPLE_BRANCH_STACK (LBR) — Capture the Last Branch Record — a hardware trace of the last ~32 branches taken. Gives you a mini call-stack trace for free (no frame pointers needed). Used for AutoFDO (feedback-directed optimization) — profile in production, feed branch traces to the compiler.

Networking

  • sendmsg() with MSG_ZEROCOPY — True zero-copy send. The kernel pins your userspace pages and DMAs directly from them. You get a completion notification via recvmsg(MSG_ERRQUEUE). Used for 10G+ network throughput.
  • SO_BUSY_POLL / SO_PREFER_BUSY_POLL — Spin-poll the NIC driver directly from epoll_wait/recvmsg, bypassing softirq. Trades CPU for ~50% latency reduction. Used in HFT and latency-sensitive services.
  • BPF_PROG_TYPE_XDP (via bpf()) — Attach eBPF programs that run at the NIC driver level before the kernel allocates an sk_buff. Can drop, redirect, or modify packets at line rate. Cloudflare's DDoS mitigation, Facebook's L4 load balancer (Katran).
  • SO_REUSEPORT + BPF_PROG_TYPE_SK_REUSEPORT — Multiple sockets on same port with eBPF-controlled routing. Perfect consistent hashing for load balancing across worker threads without a dispatch bottleneck.
  • TCP_FASTOPEN — Send data in the SYN packet. Saves one RTT on connection establishment. Requires a TFO cookie from a prior connection.
  • SCM_RIGHTS via sendmsg() — Pass file descriptors between processes over Unix domain sockets. The mechanism behind graceful restarts in Envoy, Nginx, and HAProxy (hand off live connections to a new process).
  • AF_XDP sockets — Kernel-bypass networking. Userspace reads/writes packets directly from NIC ring buffers via shared UMEM. Near-DPDK performance without leaving the kernel's control plane.
  • TCP_ULP (Upper Layer Protocol) — Plug a protocol handler into the TCP stack. setsockopt(TCP_ULP, "tls") offloads TLS to the kernel (kTLS). The kernel encrypts/decrypts in-place, and you can sendfile() encrypted data — zero-copy TLS. Nginx and HAProxy use this for ~30% TLS throughput improvement.
  • SO_INCOMING_CPU — Query which CPU received data for this socket. Combined with SO_REUSEPORT + eBPF, lets you build a NUMA-aware network stack where each core processes packets from its local NIC queue through to the application without cross-core traffic.
  • TCP_REPAIR — Put a TCP socket into "repair mode" where you can get/set the full internal state (sequence numbers, window sizes, congestion state). Used by CRIU to checkpoint and restore live TCP connections across process migration. Also used for live migration of VMs and containers without dropping connections.
  • SO_TXTIME / SCM_TXTIME (ETF scheduler) — Schedule packet transmission at a precise future timestamp. The kernel holds the packet and sends it at the specified time. Used in Time-Sensitive Networking (TSN) for industrial control, audio/video production, and automotive ethernet. Requires hardware support (NIC with time-based scheduling).
  • SO_TIMESTAMPING (hardware timestamps) — Get nanosecond-precision timestamps from the NIC hardware. SOF_TIMESTAMPING_TX_HARDWARE tells you exactly when a packet hit the wire, not when the kernel queued it. PTP (Precision Time Protocol) and HFT use this. Combined with SO_SELECT_ERR_QUEUE, you can get TX timestamps asynchronously via epoll.
  • PACKET_FANOUT with PACKET_FANOUT_EBPF — Distribute raw packets across a group of AF_PACKET sockets using an eBPF program for routing decisions. Used by network monitoring tools (Suricata, Zeek) to load-balance packet processing across cores with custom flow affinity.
  • TCP_NOTSENT_LOWAT — Set the threshold of unsent data at which the socket becomes writable. Default behavior: epoll reports writable when the send buffer has space. With this, it reports writable only when unsent data drops below your threshold. Reduces bufferbloat and memory usage for write-heavy workloads. Apple invented this for macOS, Linux adopted it in 3.12.

cgroups & Resource Control (v2)

  • cgroup.pressure (PSI — Pressure Stall Information) — Read some and full stall percentages for CPU, memory, and I/O per cgroup. some = at least one task stalled. full = all tasks stalled. You can poll() on these files with trigger thresholds ("notify me when memory full stall exceeds 100ms in any 1s window"). Android and systemd-oomd use this for proactive OOM killing before the system falls over.
  • memory.reclaim (5.19+) — Write a byte count to proactively reclaim memory from a cgroup. Unlike MADV_PAGEOUT (which is per-VMA), this works on the entire cgroup. Kubernetes node agents use this for graceful memory pressure relief instead of waiting for the kernel's OOM killer.
  • cpu.max.burst — Allow a cgroup to accumulate unused CPU quota and burst above its limit temporarily. Solves the "microservice uses 10ms every 100ms but gets throttled because the 10ms happens in a 2ms burst" problem that plagues CFS bandwidth throttling.
  • io.latency — Set latency targets per cgroup for block I/O. The kernel dynamically throttles other cgroups to meet your target. Unlike io.max (hard limits), this is workload-adaptive. Used for SSD-backed databases where you want latency SLOs, not throughput caps.

eBPF

  • bpf() (eBPF) — The meta-syscall. Attach sandboxed programs to kprobes, tracepoints, cgroup hooks, socket filters, LSM hooks, scheduler hooks. Effectively lets you extend the kernel at runtime.
  • BPF_MAP_TYPE_RINGBUF — Lock-free, variable-length ring buffer shared between kernel eBPF and userspace. Replaces BPF_MAP_TYPE_PERF_EVENT_ARRAY with better performance (no per-CPU waste, no lost events with proper backpressure). The modern way to stream events from kernel to userspace.
  • BPF_PROG_TYPE_STRUCT_OPS — Replace kernel struct_ops callbacks with eBPF programs. You can replace the TCP congestion control algorithm at runtime (bpf_struct_ops_tcp_congestion_ops). Netflix wrote a custom congestion controller this way. You can also replace the scheduler's task selection logic (sched_ext).
  • sched_ext (via eBPF struct_ops) (6.12+) — Write your scheduler in eBPF. The kernel calls your eBPF program to make scheduling decisions (which task runs on which core). Meta uses scx_rusty for cache-aware scheduling in production. You can implement gang scheduling, latency-optimized scheduling, or ML-driven scheduling without recompiling the kernel.
  • BPF_MAP_TYPE_ARENA (6.9+) — Shared memory arena between eBPF and userspace at the same virtual address. Pointers work across both. Enables complex data structures (linked lists, trees) shared between kernel and userspace without serialization. A game-changer for high-performance observability.
  • fentry/fexit eBPF program types — Attach to kernel function entry/exit with zero overhead when not attached (uses trampolines, not breakpoints). Replaces kprobes for most use cases with 5-10x less overhead. You get typed access to function arguments via BTF.

Namespace & Container Internals

  • setns(fd, CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET | ...) — Enter an existing namespace. A monitoring agent can setns into a container's network namespace, run diagnostics, and setns back. Combined with pidfd_open + pidfd_getfd, you can inspect a container's state without docker exec.
  • pivot_root() — Atomically swap the root filesystem. Unlike chroot (which is trivially escapable with fchdir on an open fd), pivot_root puts the old root inside the new root, then you umount it. This is how every container runtime sets up the root filesystem. chroot is not a security boundary; pivot_root + mount namespace is.
  • mount(..., MS_SLAVE | MS_SHARED | MS_PRIVATE) — Mount propagation. MS_SHARED means mounts propagate bidirectionally between mount namespaces. MS_SLAVE means propagation is one-way (host→container but not container→host). MS_PRIVATE means no propagation. Getting this wrong in a container runtime means either mounts leak between containers (security) or USB devices don't appear inside containers (usability).
  • clone3() with CLONE_INTO_CGROUP — Spawn a process directly into a cgroup atomically. Without this, you clone() then write(pid, cgroup.procs), and there's a window where the child runs in the parent's cgroup, potentially consuming resources from the wrong budget. Kubernetes and systemd use CLONE_INTO_CGROUP to eliminate this race.
  • unshare(CLONE_NEWUSER) + write uid_map/gid_map — Create a user namespace where you're root (UID 0 maps to your real UID). Inside, you can create other namespaces, mount filesystems, configure networking — all without real root. This is the entire foundation of rootless Podman and rootless Docker. The uid_map write must happen before any privilege-requiring operation in the child namespace.
  • unshare(CLONE_NEWCGROUP | CLONE_NEWUSER | ...) — Create new namespaces without forking. You can drop a running thread into a new user namespace, gain capabilities within that namespace, mount filesystems, then setns() back. This is how rootless containers work (Podman, user-namespaced Docker).

Advanced Process & Thread Control

  • clone3() with CLONE_CLEAR_SIGHAND — Clear all signal handlers atomically at fork. Without this, you fork and have a window where the child inherits a SIGTERM handler that calls exit() on data structures it shouldn't touch. Container runtimes and process supervisors need this.
  • prctl(PR_SET_CHILD_SUBREAPER) — Make this process the reaper for all orphaned descendants, not just direct children. systemd, Docker, and any process supervisor use this so double-forked grandchildren don't escape to PID 1. Without it, you get zombie leaks in containers.
  • prctl(PR_SET_PDEATHSIG) — Send a signal when the parent dies. Sounds simple, but the subtlety is it tracks the thread that called clone, not the process. If that thread exits but the process lives, your child gets a spurious death signal. Senior engineers know to call this after fork and verify getppid() hasn't already changed (race window).
  • waitid(P_PIDFD, ...) — Wait on a pidfd instead of a PID. Completely race-free — no TOCTOU between "is this process alive" and "wait for it." Combined with CLONE_PIDFD in clone3(), you get a fully race-free process lifecycle.
  • prctl(PR_SET_SYSCALL_USER_DISPATCH) (5.11+) — Redirect syscalls to a userspace handler for a given address range. Windows binary emulation (Wine/Proton) uses this — when Windows code makes a syscall, instead of going to Linux's syscall table, it gets dispatched to Wine's NT kernel emulation layer. Also used for syscall interposition without ptrace overhead.

Signals & Error Handling

  • signalfd() / signalfd4() — Convert signals into file descriptor events. Read signals via read()/epoll instead of async handlers. Eliminates the entire class of async-signal-safety bugs. Every well-written event loop (systemd, s6) uses this.
  • prctl(PR_MCE_KILL, PR_MCE_KILL_EARLY) — Get SIGBUS early when a memory page has a hardware (ECC) error, instead of the kernel silently killing you later. Database engines use this to detect hardware memory corruption and fail fast on the affected page rather than silently corrupting data.

Misc

  • execveat(fd, "", argv, envp, AT_EMPTY_PATH) — Execute a program by file descriptor instead of path. The binary can be unlinked (deleted) — you're running directly from an open fd. Used for memfd-based execution: memfd_create → write binary to memfd → execveat(memfd_fd). No file ever touches disk. Container runtimes use this to avoid TOCTOU between security scanning a binary and executing it.
  • copy_file_range() with cross-filesystem support (5.3+) — Server-side copy between different filesystems via splice internally. Before 5.3, this only worked within the same filesystem. Now the kernel does an optimized in-kernel copy that avoids the userspace read()+write() loop. For NFS→NFS copies, the server can do it without data touching the client at all.
  • name_to_handle_at() + open_by_handle_at() — File handles survive rename/move. Get a handle (contains filesystem UUID + inode + generation), store it, and later open_by_handle_at even if the file moved. NFS servers use this as their "file ID." Also usable for a lightweight file-change detection system — if the generation number changes, the inode was recycled.
  • eventfd() + EFD_SEMAPHORE — Kernel-backed semaphore accessible via file descriptor. write(fd, &val) increments by val. read(fd) decrements by 1 (with EFD_SEMAPHORE) or drains to 0 (without). Poll-able via epoll. Used for cross-process synchronization, VFIO interrupt delivery, and io_uring event notification. The kernel fast-path is ~50ns.
  • timerfd_create(CLOCK_BOOTTIME_ALARM) — Timer that wakes the system from suspend. Your monitoring daemon can suspend the machine, and the timer fires → machine wakes → daemon runs health check → suspends again. Used in IoT and embedded systems for periodic sensor polling with minimal power.
  • getrandom(buf, len, GRND_INSECURE) (5.6+) — Fast random bytes without blocking, even if the entropy pool isn't fully initialized. Returns potentially predictable bytes during early boot. For non-cryptographic uses (hash seed randomization, ASLR, load balancing), this is fine and avoids the "my service blocks for 30 seconds at boot waiting for entropy" problem that plagues headless VMs.
  • ioctl(BTRFS_IOC_SEND) / ioctl(BTRFS_IOC_RECEIVE) — Stream a btrfs snapshot delta as a byte stream. Takes two snapshots, computes the diff, and writes a stream of commands (create file, write range, clone extent, set permissions). Pipe it over SSH for incremental backup. Used by btrbk and every serious btrfs backup tool. The diff is computed from metadata trees, not by reading file contents — O(changes) not O(data).
  • watch_mount() / watch_sb() (proposed/6.x) — Subscribe to mount table and superblock changes via a pipe. Replaces polling /proc/mounts. Container runtimes currently poll() on /proc/mounts which is O(n) in total mounts system-wide — on a host with 10K containers and 100K mounts, this is measurably expensive.
  • close_range() — Close a range of file descriptors in one syscall. Essential for secure process spawning — closing fds 3..MAX one-by-one was a measurable bottleneck in fork-heavy workloads.
  • membarrier() — Issue memory barriers in remote threads without IPI in the common case. Used by userspace RCU implementations (liburcu) to avoid expensive smp_mb() on the read side.
  • landlock_create_ruleset() / landlock_add_rule() / landlock_restrict_self() — Unprivileged sandboxing. A process can restrict its own filesystem access without root. The modern alternative to chroot/seccomp for filesystem isolation.
  • mount_setattr() / open_tree() / move_mount() / fsopen() / fsmount() — The new mount API (5.2+). Atomic, race-free, composable. Container runtimes are migrating from the old mount() syscall to these.
  • io_uring + IORING_OP_URING_CMD — Pass custom commands through io_uring to drivers (NVMe passthrough, network zero-copy). Effectively makes io_uring an extensible syscall interface.
  • listmount() / statmount() (6.8+) — Iterate and query mount table entries by ID instead of parsing /proc/mounts. O(1) lookup by mount ID vs O(n) text parsing. Container runtimes that manage thousands of mounts care about this.
  • mseal() (6.10+) — Seal memory mappings so they can't be changed (mprotect, munmap, mremap all fail). A mitigation against ROP/JOP attacks that mprotect writable→executable. Chrome's V8 and other JIT engines use this to lock down JIT pages after compilation.
  • map_shadow_stack() (6.6+) — Allocate a shadow stack page for Intel CET (Control-flow Enforcement Technology). The CPU maintains a parallel return-address stack in write-protected memory. If a ROP gadget overwrites the real return address, it won't match the shadow stack and the CPU faults. Glibc 2.39+ enables this automatically on supported hardware.
  • futex_waitv() (5.16+) — Wait on multiple futexes simultaneously. Windows has WaitForMultipleObjects; Linux never had an equivalent until this. Wine/Proton needed this desperately for game compatibility — many Windows games wait on multiple synchronization objects at once.

Exotic / Niche

  • kcmp(KCMP_EPOLL_TFD) — Check if a specific target fd is monitored by a specific epoll instance, in another process. Checkpoint/restore (CRIU) needs this to reconstruct the exact epoll topology of a process tree.
  • process_vm_readv() / process_vm_writev() — Read/write another process's memory in a single syscall (no ptrace attach/detach overhead, no /proc/pid/mem open). Debuggers and profilers use this for low-overhead memory inspection. Combined with pidfd_getfd(), you can inspect a process's fds and memory atomically.
  • timerfd_create() with TFD_TIMER_CANCEL_ON_SET — Get notified when the system clock is adjusted (NTP step, settimeofday). Critical for applications that maintain time-based data structures — if the clock jumps forward 30 seconds, your timer wheel needs to know. Without this flag, your timers silently fire at the wrong time.
  • fanotify with FAN_REPORT_PIDFD | FAN_REPORT_FID — Filesystem-wide event monitoring that gives you pidfds (race-free process identification) and file handles (not paths — survives renames). Used by malware scanners, HSM (hierarchical storage), and audit systems. The FAN_MODIFY + FAN_REPORT_FID combo lets you build a change journal like NTFS's USN journal on Linux.
  • iopl() / ioperm() — Grant a userspace process direct access to x86 I/O ports. Used by DPDK's legacy mode to talk to NICs without kernel drivers. Extremely dangerous — you can brick hardware. Most people use UIO/VFIO instead now, but these still exist for legacy PCI devices.

Philosophy

The unifying theme across all of these: the default kernel behavior is optimized for the general case, and these syscalls exist to let you override that default when you know better. The kernel assumes you want page cache (O_DIRECT says you don't). The kernel assumes interrupt-driven I/O (SQPOLL/busy-poll says spin instead). The kernel assumes uniform memory (mbind says NUMA-aware). The kernel assumes you trust all your own threads (pkeys says isolate them).

Every one of these is a bet: "I know my workload better than the kernel's heuristics." The senior engineer's skill is knowing when that bet pays off and when the kernel's defaults are smarter than you think.

The real expertise isn't just knowing these exist — it's knowing when to reach for them. A senior engineer knows that MADV_HUGEPAGE can cause latency spikes from compaction, that MSG_ZEROCOPY has a crossover point below which the copy is faster, that SQPOLL burns a CPU but eliminates syscall overhead, and that rseq is useless if your workload already pins threads to cores.


Key References

  • Kerrisk, The Linux Programming Interface, No Starch Press, 2010
  • Love, Linux Kernel Development, 3rd ed., Addison-Wesley, 2010
  • Gregg, Systems Performance, 2nd ed., Addison-Wesley, 2020
  • Gregg, BPF Performance Tools, Addison-Wesley, 2019
  • man 2 pages — the authoritative reference for each syscall
  • lwn.net — in-depth coverage of every new kernel feature
  • Linux kernel source Documentation/ tree — especially admin-guide/mm/, filesystems/, networking/, bpf/
  • Axboe, "Efficient IO with io_uring", kernel.dk, 2019
  • Corbet, "The rest of the 6.x merge window" series, LWN (for each kernel release)
  • Facebook/Meta engineering blog — eBPF, sched_ext, PSI production usage
  • Cloudflare blog — XDP, kernel bypass, DDoS mitigation architecture
  • DPDK & SPDK documentation — VFIO, UIO, userspace driver patterns

See Also

  • io_uring Internals — Deep dive into io_uring architecture referenced in the storage and profiling sections
  • VFIO Internals — VFIO/IOMMU device passthrough referenced in the hardware access section
  • ISA Critical Instructions — Hardware instructions (rseq, memory barriers, pkeys) underlying many syscalls listed here
  • Filesystem Design — Filesystem internals behind O_DIRECT, fallocate, reflink, and sync_file_range