Memory Mapping for Large Datasets: Efficient Data Loading from Object Storage
Part 4 of Memory & Compute Optimisation
Your training dataset is 15TB. Your machine has 512GB of RAM. The naive approach (load everything into memory) fails immediately. The slightly-less-naive approach (stream batches from disk) works until you profile it and discover your A100s are sitting at 12% utilization because they’re waiting for the next batch to arrive from your NVMe array that’s already maxed out at 7GB/s sequential read. You’ve optimized your model. You’ve optimized your data loader workers. You’ve tuned prefetch buffers and pinned memory. But the bottleneck is no longer PyTorch anymore, it’s fundamental physics: you’re trying to move 15 terabytes through a 7 gigabyte-per-second pipe, and the math says your training run won’t be done until 28 days later.
We haven’t just scaled beyond memory, we’ve scaled beyond the epistemological assumptions of filesystems
This is where much AI/ML engineering hits the wall that separates “making the model work” from “making the infrastructure work.” The model is fine. The code is fine. The problem is that modern deep learning has outgrown the assumptions embedded in traditional storage systems. POSIX I/O was designed in the 1970s for workloads where files fit in memory and sequential access was king. fopen, fread, & fclose were tools built for a world of bounded texts. Now, we’re training on datasets that evoke Borges’ Library of Babel: not just vast, but structured in ways that defeat comprehension. In Borges’ library, the problem wasn’t merely that the books are infinite, but moreso that the catalog is useless, the organization arbitrary, and any path through the hexagonal galleries is as valid and as meaningless as any other. Very proto-pomo. Our datasets present the same vertigo: sharded across object storage, replicated, compressed, with metadata that sprawls into its own storage problem. Which shard contains sample 10,000,000? Which replica is authoritative? The index to answer these questions grows until it rivals the datasets of a decade ago. We haven’t just scaled beyond memory, we’ve scaled beyond the epistemological assumptions of filesystems that presumed you could know what you had and where it lived.
In part four we cover:
The Basics (Why mmap matters)
When mmap Breaks (Thrashing)
NUMA and mmap (Hidden locality tax)
Prefetching and Parallelism (Because, obviously)
Object Storage Abstraction (WebDataset)
The evolution of data loading for deep learning is a chronicle of progressively abandoning abstractions that no longer serve us. First we abandoned in-memory loading (data too big). Then we abandoned naive file I/O (too slow). Then we abandoned local storage entirely (too expensive, too inflexible). What remains is a patchwork of memory mapping, object storage, and increasingly baroque workarounds for the fact that we’re trying to feed GPU clusters that process terabytes per hour from storage systems designed for workloads that process gigabytes per day. The gap between compute capability and storage bandwidth grows every year. A100s shipped in 2020. The storage APIs they use were designed in 1970. Let that sink in, as we say. But this impedance mismatch isn’t really a bug so much as the natural consequence of Moore’s Law applying to compute but not to I/O. Fr fr.
Memory mapping is the first tool you reach for when datasets exceed RAM but still fit on local storage. The idea is elegant: let the operating system manage the data movement. Map a file directly into your process’s address space. Access it like a normal array. The OS handles paging in chunks from disk as needed, caching hot pages in RAM, evicting cold pages when memory pressure builds. You get the illusion of infinite RAM backed by disk, and the kernel’s page cache is usually smarter than anything you’d implement yourself. This works beautifully for datasets that fit the kernel’s assumptions: files on local disk, access patterns with spatial locality, working sets that fit in available RAM. It fails catastrophically when those assumptions break: data on network storage, random access across the entire file, working sets that exceed memory by 10x or more. Understanding when memory mapping helps and when it hurts is the difference between a training run that scales and one that thrashes.
The Basics: mmap and Why It Matters
Memory mapping, via mmap() on Unix or MapViewOfFile() on Windows, establishes a correspondence between a region of a process’s virtual address space and a file on disk. When you read from that address range, the OS intercepts the access, loads the corresponding file block into RAM, and returns the data. When you write, the OS marks pages as dirty and flushes them back to disk asynchronously. No explicit read() calls. No buffer management. Just pointer arithmetic and memory loads, with the OS handling all the I/O in the background.
Here’s the simplest case:
import mmap
import numpy as np
import time
import os
def create_large_dataset(filename, size_gb=10):
“”“
Create a large binary dataset for testing.
Using NumPy arrays saved to disk.
“”“
num_samples = int(size_gb * 1024**3 / (4 * 1024)) # Assuming 4KB per sample
samples_per_chunk = 10000
print(f”Creating {size_gb}GB dataset: {num_samples} samples...”)
# Write in chunks to avoid OOM
with open(filename, ‘wb’) as f:
for i in range(0, num_samples, samples_per_chunk):
chunk_size = min(samples_per_chunk, num_samples - i)
data = np.random.randn(chunk_size, 1024).astype(np.float32)
data.tofile(f)
if i % 100000 == 0:
print(f” Written {i / num_samples * 100:.1f}%”)
print(f”Dataset created: {os.path.getsize(filename) / 1024**3:.2f} GB”)
def benchmark_read_methods(filename, num_reads=1000):
“”“
Compare standard file I/O vs memory mapping for random access.
“”“
file_size = os.path.getsize(filename)
sample_size = 1024 * 4 # 1024 floats = 4KB
num_samples = file_size // sample_size
print(”=” * 80)
print(f”Benchmark: {num_reads} random reads from {file_size / 1024**3:.2f} GB file”)
print(”=” * 80)
# Method 1: Standard file I/O with seek/read
print(”\nMethod 1: Standard file I/O (seek + read)”)
times_standard = []
with open(filename, ‘rb’) as f:
for _ in range(num_reads):
# Random access
offset = np.random.randint(0, num_samples) * sample_size
start = time.perf_counter()
f.seek(offset)
data = f.read(sample_size)
times_standard.append(time.perf_counter() - start)
print(f” Average time per read: {np.mean(times_standard)*1000:.3f} ms”)
print(f” Total time: {sum(times_standard):.2f} s”)
# Method 2: Memory mapping
print(”\nMethod 2: Memory mapping (mmap)”)
times_mmap = []
with open(filename, ‘rb’) as f:
# Map entire file
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
for _ in range(num_reads):
offset = np.random.randint(0, num_samples) * sample_size
start = time.perf_counter()
data = mm[offset:offset + sample_size]
times_mmap.append(time.perf_counter() - start)
mm.close()
print(f” Average time per read: {np.mean(times_mmap)*1000:.3f} ms”)
print(f” Total time: {sum(times_mmap):.2f} s”)
print(f” Speedup: {np.mean(times_standard) / np.mean(times_mmap):.2f}x”)
# Method 3: Sequential reads (for comparison)
print(”\nMethod 3: Sequential reads (baseline)”)
times_sequential = []
with open(filename, ‘rb’) as f:
for _ in range(num_reads):
start = time.perf_counter()
data = f.read(sample_size)
times_sequential.append(time.perf_counter() - start)
print(f” Average time per read: {np.mean(times_sequential)*1000:.3f} ms”)
print(f” Total time: {sum(times_sequential):.2f} s”)
# Create test dataset (only if doesn’t exist)
dataset_file = ‘test_dataset.bin’
if not os.path.exists(dataset_file):
create_large_dataset(dataset_file, size_gb=2) # 2GB for testing
# Run benchmark
benchmark_read_methods(dataset_file, num_reads=1000)Output (typical on NVMe SSD):
========================================================================
Benchmark: 1000 random reads from 2.00 GB file
========================================================================
Method 1: Standard file I/O (seek + read)
Average time per read: 0.142 ms
Total time: 0.14 s
Method 2: Memory mapping (mmap)
Average time per read: 0.008 ms
Total time: 0.01 s
Speedup: 17.75x
Method 3: Sequential reads (baseline)
Average time per read: 0.006 ms
Total time: 0.01 sMemory mapping is 18x faster for random access. The reason: the kernel’s page cache. After the first few accesses, hot pages stay in RAM. Subsequent accesses hit the cache (sub-microsecond latency) instead of hitting disk (millisecond latency). With standard file I/O, you pay kernel crossing overhead on every read, and the kernel can’t optimize across reads because it doesn’t see your access pattern. With mmap, the kernel sees page faults and can prefetch aggressively.
But this is the best case: small dataset (2GB), plenty of RAM (probably 32GB+), random access with locality. The picture changes dramatically when you violate these assumptions.
When mmap Breaks: The Thrashing Problem
Memory mapping relies on the page cache having enough RAM to hold your working set. When the working set exceeds available RAM, the kernel starts evicting pages. If your access pattern has poor locality—you touch page A, then page Z, then page B, never revisiting A—the kernel evicts A before you need it again, causing a page fault when you do. Every page fault triggers disk I/O. You’ve turned memory access into disk access, but with extra overhead from page fault handling.
This is called thrashing, and it’s catastrophic for performance:
import mmap
import numpy as np
import time
import os
import psutil
def measure_memory_pressure(filename, access_pattern=’sequential’):
“”“
Demonstrate thrashing when working set exceeds available memory.
“”“
file_size = os.path.getsize(filename)
sample_size = 1024 * 4 # 4KB
num_samples = file_size // sample_size
# Get available memory
available_memory = psutil.virtual_memory().available
print(”=” * 80)
print(f”Memory Pressure Test”)
print(”=” * 80)
print(f”File size: {file_size / 1024**3:.2f} GB”)
print(f”Available RAM: {available_memory / 1024**3:.2f} GB”)
print(f”Working set size: {file_size / 1024**3:.2f} GB”)
if file_size < available_memory * 0.5:
print(”WARNING: File fits comfortably in RAM, thrashing unlikely”)
with open(filename, ‘rb’) as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
# Access pattern: sequential
if access_pattern == ‘sequential’:
print(”\nSequential access (good locality):”)
times = []
start_total = time.perf_counter()
for i in range(0, min(num_samples, 10000)):
offset = i * sample_size
start = time.perf_counter()
data = mm[offset:offset + sample_size]
times.append(time.perf_counter() - start)
print(f” Average latency: {np.mean(times)*1000:.3f} ms”)
print(f” Throughput: {len(times) * sample_size / (time.perf_counter() - start_total) / 1024**2:.1f} MB/s”)
# Access pattern: random (poor locality)
else:
print(”\nRandom access (poor locality):”)
times = []
start_total = time.perf_counter()
for _ in range(10000):
offset = np.random.randint(0, num_samples) * sample_size
start = time.perf_counter()
data = mm[offset:offset + sample_size]
times.append(time.perf_counter() - start)
print(f” Average latency: {np.mean(times)*1000:.3f} ms”)
print(f” Throughput: {len(times) * sample_size / (time.perf_counter() - start_total) / 1024**2:.1f} MB/s”)
mm.close()
# Test both patterns
if os.path.exists(dataset_file):
measure_memory_pressure(dataset_file, ‘sequential’)
measure_memory_pressure(dataset_file, ‘random’)Output (when file size > available RAM):
========================================================================
Memory Pressure Test
========================================================================
File size: 2.00 GB
Available RAM: 3.50 GB
Working set size: 2.00 GB
Sequential access (good locality):
Average latency: 0.009 ms
Throughput: 4234.2 MB/s
Random access (poor locality):
Average latency: 1.234 ms ← 137x slower!
Throughput: 31.2 MB/sWhen working set exceeds RAM and access is random, memory mapping becomes slower than explicit buffered I/O because the kernel’s page replacement policy works against you. You’re paying page fault overhead without getting cache benefits. At this point, you need a different strategy.
NUMA and mmap: The Hidden Locality Tax
On multi-socket servers (which most serious ML training happens on), memory mapping has another nasty surprise: NUMA topology. Non-Uniform Memory Access means memory attached to CPU socket 0 is faster to access from socket 0 than from socket 1. When you mmap a file, the kernel allocates pages from whatever NUMA node happens to handle the page fault. If your data loader runs on CPUs on socket 0 but the pages get allocated on socket 1’s memory, every access pays a remote memory penalty (typically 2-3x higher latency than local memory access.)
For a 15TB dataset being fed to 8 GPUs across 2 sockets, NUMA misplacement can reduce effective bandwidth by 40-50%. The kernel has no way to know your data loader on socket 0 will be the primary accessor, so it doesn’t allocate optimally.
The solution is explicit NUMA binding:
import os
import mmap
import numpy as np
def numa_aware_mmap(filename, numa_node=0):
“”“
Memory map a file with explicit NUMA node binding.
Requires `numactl` on Linux.
“”“
# Set NUMA policy for this process
try:
import subprocess
# Get current process ID
pid = os.getpid()
# Bind memory allocations to specific NUMA node
subprocess.run([
‘numactl’, ‘--membind’, str(numa_node),
‘--’, ‘true’ # Just set policy, don’t exec
], check=False)
print(f”NUMA policy set: bind to node {numa_node}”)
except Exception as e:
print(f”Warning: Could not set NUMA policy: {e}”)
# Now mmap - pages will be allocated from specified node
with open(filename, ‘rb’) as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
return mm
def benchmark_numa_impact(filename):
“”“
Measure impact of NUMA placement on mmap performance.
Only meaningful on multi-socket systems.
“”“
# Check if system is NUMA
try:
import subprocess
result = subprocess.run(
[’numactl’, ‘--hardware’],
capture_output=True,
text=True
)
num_nodes = len([line for line in result.stdout.split(’\n’)
if ‘available:’ in line])
if num_nodes < 2:
print(”System is not NUMA, skipping NUMA benchmark”)
return
print(f”Detected {num_nodes} NUMA nodes”)
except:
print(”numactl not available, skipping NUMA benchmark”)
return
# Benchmark with different NUMA policies
print(”\n” + “=” * 80)
print(”NUMA Impact on mmap Performance”)
print(”=” * 80)
sample_size = 1024 * 4
num_reads = 10000
for node in range(min(num_nodes, 2)): # Test first 2 nodes
print(f”\nNUMA node {node}:”)
mm = numa_aware_mmap(filename, numa_node=node)
file_size = len(mm)
num_samples = file_size // sample_size
times = []
for _ in range(num_reads):
offset = np.random.randint(0, num_samples) * sample_size
start = time.perf_counter()
data = mm[offset:offset + sample_size]
times.append(time.perf_counter() - start)
mm.close()
print(f” Average latency: {np.mean(times)*1000:.3f} ms”)
print(f” Bandwidth: {num_reads * sample_size / sum(times) / 1024**2:.1f} MB/s”)
# Run NUMA benchmark if on suitable hardware
if os.path.exists(dataset_file):
benchmark_numa_impact(dataset_file)On a dual-socket server with local NVMe, NUMA-aware mapping can improve bandwidth by 2x. On a system where data comes from network storage (NAS, SAN), NUMA matters less because the network is the bottleneck, not memory access.
Object Storage: When mmap Stops Working Entirely
Memory mapping assumes your data lives on a filesystem that supports mmap: local disk, NFS (with caveats), maybe Lustre or GPFS in HPC environments. It does not work with object storage. You cannot mmap an S3 bucket. There’s no file descriptor to pass to mmap(), no POSIX filesystem semantics, no page faults that the kernel can intercept and turn into S3 GET requests.
This is where most ML infrastructure hits the next wall. You’ve outgrown local storage (too expensive, not flexible enough). You’ve moved to S3 or GCS (cheap, durable, accessible from anywhere). Now your training data lives in what amounts to Borges’ infinite library distributed across data centers: each sample exists somewhere in the vast hexagonal galleries of S3’s address space, accessible by key but not by position, readable but not mappable, present but never local. The library’s peculiar property, that every possible book exists somewhere within it, mirrors S3’s promise: every possible byte sequence exists, if only you knew its key. But unlike Borges’ librarians, who could at least walk the hexagonal galleries, you must request each sample via HTTP, paying latency overhead for the privilege of remote access to data that nominally “belongs” to you.
The naive approach is boto3 or requests:
import boto3
import io
import time
import numpy as np
def naive_s3_loading(bucket, key, num_samples=100):
“”“
Naive S3 data loading: fetch entire objects on demand.
This is what everyone writes first. It’s also catastrophically slow.
“”“
s3 = boto3.client(’s3’)
print(”=” * 80)
print(”Naive S3 Loading”)
print(”=” * 80)
times = []
bytes_transferred = 0
for i in range(num_samples):
start = time.perf_counter()
# Fetch object
response = s3.get_object(Bucket=bucket, Key=f”{key}/sample_{i}.npy”)
data = response[’Body’].read()
# Deserialize
arr = np.load(io.BytesIO(data))
elapsed = time.perf_counter() - start
times.append(elapsed)
bytes_transferred += len(data)
print(f”Samples loaded: {num_samples}”)
print(f”Average time per sample: {np.mean(times)*1000:.1f} ms”)
print(f”Total time: {sum(times):.2f} s”)
print(f”Bandwidth: {bytes_transferred / sum(times) / 1024**2:.1f} MB/s”)
print(f”Latency breakdown:”)
print(f” Min: {min(times)*1000:.1f} ms”)
print(f” Median: {np.median(times)*1000:.1f} ms”)
print(f” P95: {np.percentile(times, 95)*1000:.1f} ms”)
print(f” Max: {max(times)*1000:.1f} ms”)
Output (typical):
========================================================================
Naive S3 Loading
========================================================================
Samples loaded: 100
Average time per sample: 45.3 ms
Total time: 4.53 s
Bandwidth: 88.2 MB/s
Latency breakdown:
Min: 23.1 ms
Median: 42.7 ms
P95: 78.4 ms
Max: 156.2 ms
45ms per sample. For a training run that processes 1 million samples per epoch, that’s 12.5 hours just loading data. Your GPUs sit idle for 12 hours waiting for the next batch. This is unacceptable.
The problem is latency. Each get_object() call is an HTTP request with:
DNS lookup (cached after first, but still)
TCP handshake (3-way, ~5ms baseline)
TLS handshake (2 round trips, ~10ms)
HTTP request/response (1 round trip, ~5ms)
Data transfer (depends on size)
Even for small objects (4KB), you’re paying 20-30ms of latency overhead before any bytes move. At 45ms per sample, you can load maybe 20 samples/second. A single A100 can process 1000+ samples/second. The GPU is starving.
Prefetching and Parallelism: The Obvious Solution
The obvious fix: load samples in parallel. While the GPU processes batch N, prefetch batch N+1, N+2, N+3 in background threads.
import boto3
import concurrent.futures
import queue
import threading
import time
import io
import numpy as np
class S3DataLoader:
“”“
S3 data loader with prefetching and parallelism.
Much better than naive loading, still not optimal.
“”“
def __init__(self, bucket, prefix, prefetch_factor=16, num_workers=8):
self.bucket = bucket
self.prefix = prefix
self.prefetch_factor = prefetch_factor
self.num_workers = num_workers
# Thread pool for parallel fetching
self.executor = concurrent.futures.ThreadPoolExecutor(max_workers=num_workers)
# Prefetch queue
self.queue = queue.Queue(maxsize=prefetch_factor)
# S3 client per thread (not thread-safe to share)
self.local = threading.local()
def _get_s3_client(self):
“”“Get thread-local S3 client”“”
if not hasattr(self.local, ‘s3’):
self.local.s3 = boto3.client(’s3’)
return self.local.s3
def _fetch_sample(self, sample_idx):
“”“Fetch a single sample from S3”“”
s3 = self._get_s3_client()
key = f”{self.prefix}/sample_{sample_idx}.npy”
try:
response = s3.get_object(Bucket=self.bucket, Key=key)
data = response[’Body’].read()
arr = np.load(io.BytesIO(data))
return arr
except Exception as e:
print(f”Error fetching {key}: {e}”)
return None
def __iter__(self):
return self
def __next__(self, sample_idx):
“”“
Get next sample, prefetching future samples in background.
“”“
# Submit prefetch jobs
for offset in range(1, self.prefetch_factor + 1):
future_idx = sample_idx + offset
future = self.executor.submit(self._fetch_sample, future_idx)
try:
self.queue.put(future, block=False)
except queue.Full:
pass # Queue is full, skip prefetching this one
# Get current sample (might be prefetched, or fetch now)
if not self.queue.empty():
future = self.queue.get()
result = future.result()
else:
result = self._fetch_sample(sample_idx)
return result
def benchmark_prefetching(bucket, prefix, num_samples=1000):
“”“
Compare naive vs prefetched S3 loading.
“”“
print(”\n” + “=” * 80)
print(”S3 Prefetching Benchmark”)
print(”=” * 80)
# Naive (for comparison)
print(”\nNaive loading (sequential, no prefetch):”)
s3 = boto3.client(’s3’)
start = time.perf_counter()
for i in range(min(num_samples, 100)): # Just 100 for timing
key = f”{prefix}/sample_{i}.npy”
response = s3.get_object(Bucket=bucket, Key=key)
data = response[’Body’].read()
arr = np.load(io.BytesIO(data))
naive_time = time.perf_counter() - start
print(f” Time for 100 samples: {naive_time:.2f} s”)
print(f” Samples/second: {100 / naive_time:.1f}”)
# Prefetched
print(”\nPrefetched loading (parallel, prefetch_factor=16):”)
loader = S3DataLoader(bucket, prefix, prefetch_factor=16, num_workers=8)
start = time.perf_counter()
for i in range(min(num_samples, 100)):
sample = loader.__next__(i)
prefetch_time = time.perf_counter() - start
print(f” Time for 100 samples: {prefetch_time:.2f} s”)
print(f” Samples/second: {100 / prefetch_time:.1f}”)
print(f” Speedup: {naive_time / prefetch_time:.2f}x”)
# Example usage (requires actual S3 bucket)
# benchmark_prefetching(’my-training-data’, ‘datasets/imagenet’)With 8 parallel workers and aggressive prefetching, you can achieve 5-10x speedup over naive loading. But you’re still bound by S3’s rate limits (5,500 GET requests/second per prefix), and you’re still paying latency overhead on every request.
The next level is: don’t fetch individual samples. Fetch large chunks, decompress in memory, and serve samples from the decompressed buffer. This is what TensorFlow’s tf.data and PyTorch’s WebDataset do.
WebDataset: The Right Abstraction for Object Storage
WebDataset, created by Thomas Breuel, is the library that finally gets S3 loading right. The insight: store data as TAR archives (thousands of samples per file), stream the TAR from S3, decompress on the fly, and serve samples without materializing the entire archive. You fetch a 1GB TAR file once (1 HTTP request), then serve 10,000 samples from it (0 additional requests).
import webdataset as wds
import torch
from torch.utils.data import DataLoader
def create_webdataset_pipeline(urls, batch_size=32):
“”“
WebDataset pipeline for streaming from S3/HTTP.
This is the production-grade approach.
“”“
dataset = (
wds.WebDataset(urls)
# Shuffle at the shard level
.shuffle(1000)
# Decode samples (handles .npy, .jpg, .json, etc)
.decode(’numpy’)
# Extract fields we care about
.to_tuple(’input.npy’, ‘target.npy’)
# Convert to torch tensors
.map_tuple(torch.from_numpy, torch.from_numpy)
# Batch
.batched(batch_size)
)
return dataset
# Example: streaming from S3
# urls = [
# ‘s3://my-bucket/train-shards/shard-000000.tar’,
# ‘s3://my-bucket/train-shards/shard-000001.tar’,
# # ... thousands more
# ]
#
# dataset = create_webdataset_pipeline(urls, batch_size=32)
# loader = DataLoader(dataset, num_workers=4, batch_size=None)
#
# for batch in loader:
# inputs, targets = batch
# # Train modelWebDataset achieves 500-1000 MB/s from S3 with proper tuning, vs 50-100 MB/s with naive loading. The key optimisations:
Large shards - each TAR is 500MB-2GB, amortizing request latency
Streaming decompression - decompress on the fly, never materialize full TAR
Shuffle at shard level - load entire shard, shuffle its contents, serve in random order
Prefetch shards - while serving from shard N, download shard N+1 in background
This is what you actually deploy for production training on cloud object storage. Not mmap. Not boto3. WebDataset or equivalent (PyTorch’s TorchData with FSSpec, TensorFlow’s tf.data with GCS connector).
The Deeper Point
The progression from in-memory loading → file I/O → mmap → object storage reveals something fundamental about the relationship between abstraction and performance. Every abstraction layer (POSIX files, virtual memory, HTTP) was designed for assumptions that no longer hold at scale. Files were designed when datasets fit in memory. mmap was designed when working sets fit in RAM. HTTP was designed for documents, not terabyte-scale streaming data.
Each time scale increases by an order of magnitude, we discover that the abstraction we’re relying on has embedded assumptions that break. We then build a new abstraction (WebDataset, TorchData) that works at the current scale, until we hit the next wall and discover those assumptions break too. The cycle repeats. There’s no “right” abstraction for data loading because the right abstraction depends on the scale you’re operating at, and that scale keeps increasing.
This is why systems engineering is never finished. We’re not refining a fixed solution to a fixed problem. We’re chasing a moving target where every solution creates new bottlenecks at the next order of magnitude. Memory mapping works great at 10GB scale, becomes marginal at 1TB scale, and completely breaks at 100TB scale. The algorithm isn’t changing but the data keeps growing. And as data size grows faster than storage bandwidth improves, the abstraction that works today will be obsolete in really just a few quarters, when datasets are 10x larger but storage is only 2x faster.
𒅃 The POSIX file API, designed at Bell Labs in the 1970s, assumed that files could be opened, read sequentially, and closed. This was a reasonable assumption when a “large” file was 10 megabytes. By 2025, training datasets routinely exceed 10 terabytes, and the assumption that you can open a file has become quaint. You cannot open a 10TB file. You can barely stat it. The API doesn’t fail gracefully, it just becomes unusably slow, and we discover that abstractions designed for one scale don’t just perform poorly at another scale, they fail categorically, forcing us to rebuild the entire I/O stack from first principles. Memory mapping was meant to solve the problem of files larger than RAM. Object storage was meant to solve the problem of datasets larger than local disk. Each solution worked until scale increased by another order of magnitude, at which point we discovered we’d merely postponed the reckoning. The pattern is clear: abstraction layers are bets on scale staying within certain bounds, and those bets keep failing. Borges’ librarians never found the catalog because the catalog was itself lost in the infinite hexagonal galleries. We keep building bigger indexes, and the indexes keep becoming datasets that need their own indexes. The question then isn’t whether current solutions will break at the next order of magnitude. The question is when.
Next up: CUDA Kernels for Transformers
Where we discover that PyTorch’s default implementations are leaving 3-5x performance on the table, and writing custom CUDA is no longer optional for production systems. Also, why kernel fusion matters more than algorithmic complexity, and how to correctly profile GPU kernels when nvprof lies.
← prev: Activation Recomputation | next: CUDA Kernels →

