How Containers Actually Work — Namespaces, Cgroups, and chroot
Docker isn't magic — here's how to build a container with just Linux commands. Containers are nothing more than regular Linux processes with three layers of isolation: namespaces (what a process can see), cgroups (what a process can use), and a changed root filesystem (where a process lives). Once you understand these primitives, Kubernetes networking, Docker storage drivers, and container security all start making sense.
The Three Pillars of Containers
| Primitive | Controls | Question It Answers |
|---|---|---|
| Namespaces | Visibility | What can the process see? |
| Cgroups | Resources | How much CPU/memory can it use? |
| chroot / pivot_root | Filesystem | What filesystem does it see? |
Let's build each layer from the ground up.
Namespaces: Isolating What a Process Can See
Linux has 8 namespace types. Each one isolates a different aspect of the system.
| Namespace | Flag | Isolates |
|---|---|---|
| PID | CLONE_NEWPID | Process IDs |
| NET | CLONE_NEWNET | Network stack |
| MNT | CLONE_NEWNS | Mount points |
| UTS | CLONE_NEWUTS | Hostname |
| IPC | CLONE_NEWIPC | Inter-process communication |
| USER | CLONE_NEWUSER | User/group IDs |
| Cgroup | CLONE_NEWCGROUP | Cgroup root |
| Time | CLONE_NEWTIME | System clocks (kernel 5.6+) |
The unshare command lets you create new namespaces from the command line.
PID Namespace: A New Process Tree
# Create a new PID namespace and run bash inside it
sudo unshare --pid --fork --mount-proc bash
# Inside the new namespace:
ps aux
# PID 1 is now your bash process — it can't see host processes
# Check from the host (in another terminal):
ps aux | grep unshare
# The namespaced process still has a real PID on the host
UTS Namespace: Custom Hostname
# Create a new UTS namespace with a custom hostname
sudo unshare --uts bash
# Change hostname — only affects this namespace
hostname my-container
hostname
# Output: my-container
# Check from host — hostname is unchanged
NET Namespace: Isolated Network Stack
This is the foundation of how Docker and Kubernetes networking works.
# Create a named network namespace
sudo ip netns add mycontainer
# List namespaces
ip netns list
# Run a command in the namespace
sudo ip netns exec mycontainer ip addr
# Only loopback, no external connectivity
# Create a veth pair (virtual ethernet cable)
sudo ip link add veth-host type veth peer name veth-container
# Move one end into the namespace
sudo ip link set veth-container netns mycontainer
# Configure IP addresses
sudo ip addr add 10.200.0.1/24 dev veth-host
sudo ip link set veth-host up
sudo ip netns exec mycontainer ip addr add 10.200.0.2/24 dev veth-container
sudo ip netns exec mycontainer ip link set veth-container up
sudo ip netns exec mycontainer ip link set lo up
# Test connectivity
sudo ip netns exec mycontainer ping -c 2 10.200.0.1
This is exactly what Docker does when it creates a container with bridge networking — veth pairs connecting the container namespace to a bridge on the host.
Cgroups: Limiting What a Process Can Use
Cgroups (control groups) enforce resource limits. Most modern distros use cgroups v2.
# Check if cgroups v2 is active
stat -fc %T /sys/fs/cgroup/
# Output: cgroup2fs (v2) or tmpfs (v1)
# View the cgroup hierarchy
ls /sys/fs/cgroup/
Creating a Cgroup and Setting Limits
# Create a new cgroup
sudo mkdir /sys/fs/cgroup/mycontainer
# Set memory limit to 256MB
echo $((256 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/mycontainer/memory.max
# Set CPU limit to 50% of one core (50000 out of 100000 microseconds)
echo "50000 100000" | sudo tee /sys/fs/cgroup/mycontainer/cpu.max
# Set CPU weight (relative priority, default is 100)
echo 50 | sudo tee /sys/fs/cgroup/mycontainer/cpu.weight
# Set PID limit (max number of processes)
echo 64 | sudo tee /sys/fs/cgroup/mycontainer/pids.max
# Add the current shell to this cgroup
echo $$ | sudo tee /sys/fs/cgroup/mycontainer/cgroup.procs
Verifying Cgroup Limits
# Check current memory usage of the cgroup
cat /sys/fs/cgroup/mycontainer/memory.current
# Check if any OOM kills have occurred
cat /sys/fs/cgroup/mycontainer/memory.events
# Check CPU usage statistics
cat /sys/fs/cgroup/mycontainer/cpu.stat
| Resource File | Controls | Example Value |
|---|---|---|
memory.max | Hard memory limit | 268435456 (256MB) |
memory.high | Throttle threshold | 209715200 (200MB) |
cpu.max | CPU bandwidth | 50000 100000 (50%) |
cpu.weight | Relative CPU shares | 50 (half of default) |
pids.max | Process count limit | 64 |
io.max | Disk I/O bandwidth | 8:0 rbps=1048576 |
chroot: Changing the Root Filesystem
The final piece — give the process its own filesystem view.
# Create a minimal root filesystem using Alpine
mkdir -p /tmp/mycontainer/rootfs
cd /tmp/mycontainer
# Download Alpine Linux mini root filesystem
curl -o alpine-rootfs.tar.gz https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.4-x86_64.tar.gz
tar xzf alpine-rootfs.tar.gz -C rootfs
# Enter the chroot
sudo chroot rootfs /bin/sh
# Inside the chroot:
cat /etc/os-release
# You're now running Alpine, isolated from the host filesystem
ls / # Only Alpine's files
whoami # root (within the chroot)
Building a Container from Scratch
Let's combine all three primitives into an actual container.
#!/bin/bash
# build-container.sh — A container in ~20 lines of bash
ROOTFS="/tmp/mycontainer/rootfs"
CGROUP="/sys/fs/cgroup/scratch-container"
# Ensure rootfs exists (download Alpine if not present)
if [ ! -d "$ROOTFS/bin" ]; then
mkdir -p "$ROOTFS"
curl -sL https://dl-cdn.alpinelinux.org/alpine/v3.18/releases/x86_64/alpine-minirootfs-3.18.4-x86_64.tar.gz | tar xz -C "$ROOTFS"
fi
# Create cgroup with limits
sudo mkdir -p "$CGROUP"
echo $((128 * 1024 * 1024)) | sudo tee "$CGROUP/memory.max" # 128MB RAM
echo "25000 100000" | sudo tee "$CGROUP/cpu.max" # 25% CPU
echo 32 | sudo tee "$CGROUP/pids.max" # 32 processes
# Launch the container with all namespaces
sudo unshare \
--pid --fork \
--mount --uts --ipc \
--mount-proc="$ROOTFS/proc" \
bash -c "
# Set hostname
hostname scratch-container
# Mount essential filesystems
mount -t sysfs sysfs $ROOTFS/sys
mount -t tmpfs tmpfs $ROOTFS/tmp
# Add this process to the cgroup
echo \$\$ > $CGROUP/cgroup.procs
# Pivot into the new root
exec chroot $ROOTFS /bin/sh
"
# Cleanup
sudo rmdir "$CGROUP" 2>/dev/null
# Run it
chmod +x build-container.sh
sudo ./build-container.sh
# Inside your container:
hostname # scratch-container
ps aux # Only your shell and ps — PID 1 is sh
cat /proc/cpuinfo # Can see host CPUs but can only use 25%
free -m # Shows host memory, but cgroup enforces 128MB
Overlay Filesystems: Copy-on-Write Layers
Docker images use overlay filesystems to stack read-only layers with a writable top layer. This is how images share common base layers efficiently.
# Create the layer structure
mkdir -p /tmp/overlay/{lower,upper,work,merged}
# Lower layer: read-only base (imagine this is your base image)
echo "from base image" > /tmp/overlay/lower/base-file.txt
# Mount the overlay
sudo mount -t overlay overlay \
-o lowerdir=/tmp/overlay/lower,upperdir=/tmp/overlay/upper,workdir=/tmp/overlay/work \
/tmp/overlay/merged
# The merged view has the base file
cat /tmp/overlay/merged/base-file.txt
# Output: from base image
# Write a new file — goes to the upper layer only
echo "container data" > /tmp/overlay/merged/new-file.txt
# Verify: upper layer has the new file, lower is untouched
ls /tmp/overlay/upper/ # new-file.txt
ls /tmp/overlay/lower/ # base-file.txt (unchanged)
# Cleanup
sudo umount /tmp/overlay/merged
This is exactly how docker commit works — the upper layer becomes a new image layer.
What Docker Actually Does
Now you can map every Docker concept to Linux primitives:
| Docker Concept | Linux Primitive |
|---|---|
docker run --memory 256m | cgroups memory.max |
docker run --cpus 0.5 | cgroups cpu.max |
docker run --hostname foo | UTS namespace |
docker run --network bridge | NET namespace + veth pair |
docker run --pid host | Share host PID namespace |
| Image layers | OverlayFS lower directories |
| Container writable layer | OverlayFS upper directory |
docker exec | nsenter into existing namespaces |
# See the namespaces of a running Docker container
docker inspect --format '{{.State.Pid}}' <container_id>
ls -la /proc/<pid>/ns/
# Enter a container's namespaces directly (what 'docker exec' does)
sudo nsenter -t <pid> -m -u -i -n -p bash
Cleanup
# Remove the network namespace
sudo ip netns del mycontainer
# Remove cgroup
sudo rmdir /sys/fs/cgroup/mycontainer
# Remove rootfs
sudo rm -rf /tmp/mycontainer /tmp/overlay
Now that you understand the kernel primitives underneath containers, next we'll shift focus to securing the Linux host itself — a 20-step hardening checklist that every server exposed to the internet needs.
