Concepts

Git for Data — DVC, lakeFS, and Large Data Versioning

Why plain Git struggles with datasets and model weights, and how DVC, lakeFS, git-annex, and Git LFS each solve data versioning for ML and analytics workloads.

Written by Lance MQ · 12 years of Git

Who This Is For

Readers who want the history model before advanced commands

Prerequisites

A basic sense that commits are not just a file list

Common Risks

Treating a concepts page like a command how-to

Data & Performance

O(n) growtheach committed revision of a binary dataset adds a full new object — Git's delta packing rarely helps with compressed binariesSource: Pro Git §10.4 Packfiles

Key Quotes

Git stores every version of every file as a separate object; for large binary data that changes often, the repository grows without bound because Git cannot delta-compress most binary formats efficiently.
— Pro Git, 2nd Ed., §10.4 Packfiles

Citations & Further Reading

Pro Git §10.4 — Packfiles [Book]
DVC — Data Version Control docs [Official]
lakeFS documentation [Official]
Git LFS [Official]

What you will learn

Why plain Git bloats and slows down on datasets, model weights, and parquet files
The four main approaches to data versioning and when each fits
How DVC tracks data with pointer files while Git tracks the pointers
How lakeFS gives you Git-like branching over object storage without cloning data

Start with a problem

Your team trains an ML model. The training dataset is 8 GB of parquet. Every experiment re-saves a slightly different 8 GB file. After a month, your .git directory is 60 GB, git clone takes 20 minutes, and nobody can tell which dataset produced the model in production. Git's snapshot model — its greatest strength for source code — becomes its worst liability for large, frequently-changing binaries.

Why plain Git struggles with data

Git is a content-addressable snapshot store. For source code this is ideal: text diffs small, delta compression effective, every version cheap. For data it breaks down:

Binaries don't delta. Git's packfiles compress by storing objects as deltas against a base. Compressed binaries (parquet, PNG, model weights) are already entropy-maximized, so each revision is stored nearly in full.
History never shrinks. A deleted 8 GB file stays in the object database forever unless you rewrite history — which is painful on shared branches.
Clone cost scales with history. Every clone fetches every version of every blob.

The result: repo size grows with O(revisions × binary_size), which is unbounded for active data pipelines.

The four approaches

Approach	Data lives in	Git stores	Best for
Git LFS	LFS remote (HTTP)	LFS pointer + large file	Moderately large files (media, binaries) in a normal repo
DVC	Any cloud/object store	A small `.dvc` pointer file + the data hash	ML datasets + model artifacts tracked alongside code
git-annex	Any remote (git-annex special remotes)	Symlink + key	Larger scientific data, file-level dedup
lakeFS	Object storage (S3 etc.) directly	Branch metadata on the lake	Analytics/data-engineering teams who branch data in place

Git LFS

Git LFS replaces large files in the repo with tiny pointer files and stores the real bytes on a separate LFS server. Good when your large files are static-ish (build artifacts, design assets). It struggles when data changes every run, because every version is still stored (just on the LFS server), and LFS quotas get expensive fast.

DVC — the ML-native choice

DVC keeps data out of Git entirely. Git tracks a small .dvc pointer file containing a content hash; the actual data lives in a configurable remote (S3, GCS, Azure, SSH). This means:

The Git repo stays tiny — only pointer files are committed.
Data is reproducible: a commit + its .dvc pointers uniquely identify the exact dataset.
dvc push / dvc pull sync data to/from the remote on demand.

# Track a dataset with DVC
dvc init                       # one-time, creates .dvc/
dvc add data/train.parquet     # creates train.parquet.dvc pointer
git add train.parquet.dvc .gitignore
git commit -m "data: add train.parquet v1"

# Later: swap in a new version
dvc add data/train.parquet     # pointer hash updates
git commit -am "data: refresh train.parquet for run 42"

# Anyone can reproduce your exact dataset
git checkout <experiment-commit>
dvc pull                       # fetches the matching data version

The key insight: Git versions the pointer; DVC versions the bytes. Code and data history stay in sync because both are commits, but the bytes never enter .git.

lakeFS — branching data in place

lakeFS applies Git's branching model directly on object storage. Instead of copying data, a lakeFS branch is a metadata pointer over the same underlying objects. You can lakefs branch, commit, merge, and revert terabytes of data without copying a single byte. This fits data-engineering teams who already live in S3 and want isolated, reviewable data changes without ETL copies.

git-annex

git-annex is the older, lower-level option: it stores file content in special remotes and tracks availability with keys, exposing files as symlinks in the working tree. Powerful and flexible, but a steeper learning curve than DVC for most teams.

Choosing

Source code + occasional large static files → Git LFS.
ML: datasets + model artifacts versioned with code → DVC.
Data engineering: branch/merge large lakes in S3 → lakeFS.
Scientific computing, fine-grained dedup, many remotes → git-annex.

A common real pattern: Git for code, DVC for datasets and model weights, and a model registry on top. The repo stays small, experiments are reproducible, and git log still tells the story.

Common mistakes

Committing raw datasets into Git. The repo balloons and never recovers without git filter-repo.
Using Git LFS for data that changes every run. LFS storage costs scale with revisions just like Git.
Forgetting to commit the .dvc pointer. Without it, dvc pull can't find the right version — the pointer is the whole point.
Treating lakeFS branches as cheap copies. They're metadata; the underlying objects are shared, so destructive operations still need care.

Try it yourself

In a throwaway repo, dvc init, add a small CSV with dvc add, commit the .dvc pointer, then modify the CSV and re-add — observe how Git only sees the pointer change.
Run git count-objects -vH on a repo before and after a large binary commit to feel the bloat firsthand.
Set up a two-commit DVC history, git checkout the older commit, and dvc pull to reproduce the older dataset exactly.

Previous / Next

PreviousGit Rerere Deep DiveConcepts

NextNo more reads in this direction

Git for Data — DVC, lakeFS, and Large Data Versioning

Data & Performance

Key Quotes

Citations & Further Reading

What you will learn

Start with a problem

Why plain Git struggles with data

The four approaches

Git LFS

DVC — the ML-native choice

lakeFS — branching data in place

git-annex

Choosing

Common mistakes

Try it yourself

Further reading

Further reading

Previous / Next

Data & Performance

Key Quotes

Citations & Further Reading

What you will learn

Start with a problem

Why plain Git struggles with data

The four approaches

Git LFS

DVC — the ML-native choice

lakeFS — branching data in place

git-annex

Choosing

Common mistakes

Try it yourself

Further reading

Further reading

Related Reads

Previous / Next