Clickhouse Backup, the hard way
Table of Contents
Situation
I was working on a Docker-based log analysis platform where ClickHouse stores the production logs. The system already included Django, Nginx, ClickHouse, and clickhouse-backup, but backup operations were still too close to infrastructure commands: users needed a practical way to create, inspect, download, upload, restore, and delete backups from the web interface.
What looked like a small CRUD feature became an operations problem. Backups can be large, uploads pass through several layers, and restore is destructive. A frozen request or a vague “restore failed” message was not acceptable for production log tables.
The final feature creates all-table, selected-table, or schema-only backups; lists metadata and actions; builds downloadable archives; uploads and restores .tar.gz files; reports progress; supports cancellation and deletion; and presents the workflow in English and Persian.
Problem
The first implementation proved the basic idea, but also exposed most of the dangerous edge cases.
Archive downloads were originally created synchronously with a tar subprocess. Large backups held HTTP requests open, duplicate clicks could duplicate work, and partial archives could look complete. Upload progress could also reach 100% while Django was still copying the file.
The first restore-from-file prototype trusted direct tar extraction, introducing path-traversal and link risks. It required a shadow/ directory, which broke schema-only backups, and early versions extracted files without completing the ClickHouse restore.
The bigger issue was restore semantics. Calling perform_restore() directly against production could replace schema objects before the new data was known to be usable. Stopping at the wrong moment could leave production half-restored.
There were smaller integration bugs too:
- Internal directories such as
.restore_status,.restore_cancel, and.import-*could leak into the backup list. - The create API did not always return the exact status code or JSON shape the UI expected.
- “No tables selected” had historical schema-only behavior that needed to be made explicit.
- Missing runtime directories caused first-run upload or restore failures.
- Long uploads needed matching Nginx body-size, timeout, and buffering settings.
- Session expiry could turn an AJAX restore request into an HTML login redirect.
- Progress files, locks, temporary archives, and cancel flags could become stale after interruption.
Solution
I kept clickhouse-backup as the backup engine and built an orchestration layer around it.
The API wrapper in backup.utils handles the infrastructure calls:
fetch_backups(),fetch_actions(), andfetch_tables()parse the API’s newline-delimited JSON.fetch_details()reads and formatsmetadata.json.perform_create()translates the form intoname,table, andschemaparameters while accepting any successful 2xx response.perform_restore()supports database/table mappings, selected tables, schema/data modes, and an optionalrmflag.perform_delete()removes the local backup through the API.is_internal_backup_name()hides UI runtime artifacts from users.
The Django views add orchestration. Operation IDs make logs traceable. sanitize_archive_name(), sanitize_upload_id(), and sanitize_restore_id() constrain user-controlled names. _ensure_backup_runtime_dirs() handles clean installations, while _write_json_atomic() prevents polling clients from reading half-written JSON.
For downloads, backup_download() became a state machine. It streams files into .tar.gz.part, publishes throttled progress, and atomically renames the result. An exclusive .lock prevents duplicate workers; .cancel supports cancellation; .progress exposes status; and ten-hour-old artifacts are stale. Completed archives are reused.
Uploads stream in 8 MiB chunks to a hidden .uploading file and are renamed after the copy. The UI separates network transfer from server finalization, polls backup_upload_progress(), keeps the session alive, and stores the archive name in localStorage.
safe_extract_backup_archive() validates every member before extraction. _validate_tar_members() rejects paths outside staging, links, and device entries. _detect_extracted_backup_root() accepts flat or singly nested archives, and is_valid_backup_dir() requires metadata.json plus metadata/ while allowing schema-only backups without shadow/.
Architecture
The system has four cooperating parts:
- Nginx accepts uploads up to 20 GiB, disables proxy buffering, and uses one-hour request/read timeouts.
- Django provides authentication, forms, endpoints, background workers, progress files, archive validation, and the final database cutover.
clickhouse-backupruns as a separate container with its HTTP API on port7171and performs backup creation, deletion, and staging restores.- ClickHouse and shared local disk hold production data, backup directories, archives, staging databases, and UI state.
Both Django and clickhouse-backup mount /var/lib/clickhouse. Backups live under backup/, archives under archive/, and restore state under .backup_ui_state/.
The most important architectural decision was not restoring directly into production. _load_backup_tables_from_metadata() discovers the tables, _build_restore_db_mapping() creates names such as restore_stg_<id>_<database>, and perform_restore(..., rm=False, restore_database_mapping=...) restores into those staging databases.
Only after external restore success does _perform_atomic_cutover() touch production. Existing tables use EXCHANGE TABLES; missing tables use RENAME TABLE; missing databases use the Atomic engine. If a later table fails, _rollback_cutover() reverses completed operations before _cleanup_staging_databases() removes staging databases.
This is atomic per table rather than one transaction across the entire backup, but it dramatically reduces the interval in which production is exposed.
flowchart LR
user[User / Browser]
nginx[Nginx container\nTLS reverse proxy\n20G body limit\n3600s timeouts\nproxy buffering off]
subgraph web["web container: Django app"]
urls["backup.urls\n/backups/* routes"]
views["backup.views\nrequest handlers"]
forms["BackupForm\nloads table choices"]
utils["backup.utils\nHTTP client for clickhouse-backup"]
archiveThread["Archive creation thread\ncreates .tar.gz for download"]
restoreThread["Atomic restore thread\nsingle-restore operating rule"]
uploadStatus["Upload status JSON\narchive/.upload_status/{upload_id}.json"]
restoreStatus["Restore status JSON\n.backup_ui_state/restore_status/{restore_id}.json"]
cancelFlag["Restore cancel flag\n.backup_ui_state/restore_cancel/{restore_id}.cancel"]
end
subgraph chBackup["clickhouse-backup container"]
chbApi["HTTP API server\n0.0.0.0:7171"]
chbActions["Actions history\n/backup/actions?last=20"]
end
subgraph clickhouse["clickhouse container"]
chServer["ClickHouse server\nTCP 9000"]
prodDB["Production databases/tables"]
stageDB["Restore staging databases\nrestore_stg_{restoreid}_{source_db}"]
end
subgraph disk["Shared local disk mounted as /var/lib/clickhouse"]
backupRoot["backup/\nclickhouse-backup local backup dirs\n{backup}/metadata.json\n{backup}/metadata/"]
archiveRoot["archive/\nUI download archives\nuploaded .tar.gz files\n.part/.lock/.progress/.cancel"]
uiState[".backup_ui_state/\nrestore status, cancel flags,\nrestore extraction staging"]
chData["ClickHouse table data"]
end
user -->|HTTPS UI/API requests| nginx
nginx -->|proxy /backups/*| urls
urls --> views
views --> forms
forms -->|GET /backup/tables| utils
views --> utils
utils -->|GET /backup/list| chbApi
utils -->|GET /backup/actions?last=20| chbApi
utils -->|GET /backup/tables| chbApi
utils -->|POST /backup/create| chbApi
utils -->|POST /backup/restore/{name}| chbApi
utils -->|POST /backup/delete/local/{name}| chbApi
chbApi --> chbActions
chbApi -->|create/restore/delete| chServer
chbApi -->|read/write local backups| backupRoot
views -->|read metadata.json| backupRoot
views -->|write uploaded archives| archiveRoot
views -->|write upload status| uploadStatus
views -->|write restore status| restoreStatus
views -->|write cancel flag| cancelFlag
views -->|spawn| archiveThread
archiveThread -->|tar backup dir| backupRoot
archiveThread -->|write .part then .tar.gz| archiveRoot
views -->|spawn| restoreThread
restoreThread -->|extract uploaded archive into staging| uiState
restoreThread -->|replace same-named backup dir| backupRoot
restoreThread -->|POST restore with database mapping| chbApi
restoreThread -->|poll actions every 2s, max 2h| chbActions
restoreThread -->|direct ClickHouse client| chServer
restoreThread -->|EXCHANGE or RENAME tables| prodDB
restoreThread -->|drop staging DBs| stageDB
chServer --> prodDB
chServer --> stageDB
chServer --> chData
backupRoot --- disk
archiveRoot --- disk
uiState --- disk
chData --- disk
Backup Creation and Listing
sequenceDiagram
autonumber
actor User
participant Browser
participant Nginx
participant Django as Django web / backup.views
participant Form as BackupForm
participant Utils as backup.utils
participant CHB as clickhouse-backup API
participant CH as ClickHouse
participant Disk as /var/lib/clickhouse/backup
User->>Browser: Open /backups/
Browser->>Nginx: GET /backups/
Nginx->>Django: proxy request
Django->>Utils: fetch_backups()
Utils->>CHB: GET /backup/list
CHB->>Disk: list local backup directories
CHB-->>Utils: newline-delimited JSON
Utils-->>Django: backups excluding internal artifacts
Django-->>Browser: items_list.html
User->>Browser: Click "Create a Backup"
Browser->>Django: GET /backups/create/
Django->>Form: instantiate BackupForm
Form->>Utils: fetch_tables()
Utils->>CHB: GET /backup/tables
CHB->>CH: inspect tables
CHB-->>Utils: table list
Utils-->>Form: only Database == default
Django-->>Browser: create_form.html
User->>Browser: Submit create form
Browser->>Django: POST /backups/create/
Django->>Utils: perform_create(form data)
alt All tables selected
Utils->>CHB: POST /backup/create?name={name}
else Selected tables
Utils->>CHB: POST /backup/create?name={name}&table=default.t1,default.t2
else Schema-only or no table options
Utils->>CHB: POST /backup/create?name={name}&schema=true
end
CHB->>CH: create backup from ClickHouse
CHB->>Disk: write backup metadata and data
CHB-->>Utils: acknowledged or error
Utils-->>Django: result
Django-->>Browser: JSON acknowledged / failed
loop Every 3 seconds until matched action succeeds or errors
Browser->>Django: GET /backups/actions/
Django->>Utils: fetch_actions()
Utils->>CHB: GET /backup/actions?last=20
CHB-->>Browser: actions via Django
end
Browser->>Django: GET /backups/details/{name}
Django->>Disk: read {name}/metadata.json
Django-->>Browser: backup details for new row
Download Archive Flow
sequenceDiagram
autonumber
actor User
participant Browser
participant Django as Django web / backup_download
participant Worker as Archive thread
participant BackupDir as /var/lib/clickhouse/backup/{name}
participant ArchiveDir as /var/lib/clickhouse/archive
User->>Browser: Click Download
Browser->>Django: POST /backups/download/{name}/
Django->>BackupDir: verify backup directory exists
Django->>ArchiveDir: check {name}.tar.gz, .lock, .part, .progress, .cancel
alt Archive already ready
Django-->>Browser: {"status":"ready","progress":1.0}
Browser->>Django: GET /backups/download/{name}/
Django-->>Browser: FileResponse {name}.tar.gz
else Archive missing and lock acquired
Django->>Worker: spawn daemon archive thread
Worker->>ArchiveDir: create exclusive .lock
Worker->>BackupDir: inventory files and total bytes
Worker->>ArchiveDir: write .progress JSON
loop tar each file
Worker->>ArchiveDir: write {name}.tar.gz.part
Worker->>ArchiveDir: update .progress throttled by time/bytes
Worker->>ArchiveDir: check .cancel
end
Worker->>ArchiveDir: rename .part to .tar.gz
Worker->>ArchiveDir: remove .lock/.progress/.cancel
Django-->>Browser: {"status":"in_progress","progress":0}
else Another archive worker is active
Django-->>Browser: current status from lock/progress files
end
loop Every 2 seconds while in_progress
Browser->>Django: GET /backups/download/{name}/?status=1
Django->>ArchiveDir: read .progress / inspect .tar.gz
Django-->>Browser: ready / in_progress / cancelled / error / missing
end
opt User cancels archive creation
Browser->>Django: POST /backups/download/{name}/ action=cancel
Django->>ArchiveDir: touch .cancel if running
Worker->>ArchiveDir: notice .cancel, remove .part, write cancelled status
Django-->>Browser: {"status":"cancelled"}
end
opt Stale artifacts
Django->>ArchiveDir: remove .lock/.part/.progress/.cancel older than 10 hours
end
Restore From Existing Backup
sequenceDiagram
autonumber
actor User
participant Browser
participant Django as Django web / backup_restore
participant Status as restore_status JSON
participant Worker as Atomic restore thread
participant Utils as backup.utils
participant CHB as clickhouse-backup API
participant CH as ClickHouse
participant BackupDir as /var/lib/clickhouse/backup/{name}
participant Cancel as restore_cancel flag
User->>Browser: Click Restore on existing backup
Browser->>Django: GET /backups/details/{name}
Django->>BackupDir: read metadata.json
Django-->>Browser: metadata details
User->>Browser: Submit password
Browser->>Django: POST /backups/restore/{name}/ with X-Restore-ID
Django->>Django: validate password
Django->>BackupDir: require metadata.json and metadata/
Django->>Status: write queued status
Django->>Worker: spawn daemon restore thread
Django-->>Browser: {"status":"accepted","restore_id":...}
Worker->>Status: preparing_backup_dir 10%
Worker->>BackupDir: parse metadata/{db}/*.json into db_to_tables
Worker->>Worker: build staging DB mapping restore_stg_{id}_{db}
Worker->>Status: staging_restore_start 70%
Worker->>Utils: perform_restore(name, rm=false, restore_database_mapping=...)
Utils->>CHB: POST /backup/restore/{name}?restore_database_mapping=...&rm omitted
CHB->>CH: restore backup into staging databases
CHB-->>Utils: acknowledged with optional operation_id
loop Every 2 seconds, max 2 hours
Worker->>Cancel: check cancel flag
Worker->>Utils: fetch_actions()
Utils->>CHB: GET /backup/actions?last=20
CHB-->>Worker: recent actions
Worker->>Status: staging_restore_in_progress 85-99%
end
alt Staging restore action success and no cancel requested
Worker->>Status: cutover_start 90%
loop For each table in backup metadata
Worker->>CH: CREATE DATABASE IF missing
alt Production table exists
Worker->>CH: EXCHANGE TABLES production AND staging
else Production table missing
Worker->>CH: RENAME TABLE staging TO production
end
Worker->>Status: cutover progress 90-99%
end
Worker->>CH: DROP DATABASE IF EXISTS each staging DB
Worker->>Status: completed 100%
else Staging restore action error
Worker->>CH: DROP DATABASE IF EXISTS each staging DB
Worker->>Status: error, previous production tables preserved
else Poll timeout
Worker->>CH: DROP DATABASE IF EXISTS each staging DB
Worker->>Status: error, previous production tables preserved
else Cancel requested before cutover
Worker->>CH: DROP DATABASE IF EXISTS each staging DB
Worker->>Status: canceled, previous production tables preserved
else Cutover failure
Worker->>CH: rollback completed EXCHANGE/RENAME operations
Worker->>CH: DROP DATABASE IF EXISTS each staging DB
Worker->>Status: error, previous production tables preserved as far as rollback succeeds
end
loop Browser polls every 3 seconds
Browser->>Django: GET /backups/restore/progress/{restore_id}/
Django->>Status: read JSON
Django-->>Browser: queued / processing / completed / canceled / error
end
opt User cancels restore
Browser->>Django: POST /backups/restore/cancel/{restore_id}/
Django->>Cancel: touch {restore_id}.cancel
Django->>Status: cancel_requested=true
Django-->>Browser: {"status":"cancel_requested"}
end
Restore From Uploaded Archive
sequenceDiagram
autonumber
actor User
participant Browser
participant Django as Django web
participant UploadStatus as upload_status JSON
participant ArchiveDir as /var/lib/clickhouse/archive
participant Worker as Atomic restore thread
participant Stage as .backup_ui_state/restore_staging
participant BackupRoot as /var/lib/clickhouse/backup
participant CHB as clickhouse-backup API
participant CH as ClickHouse
participant RestoreStatus as restore_status JSON
User->>Browser: Select .tar.gz and password
Browser->>Django: POST /backups/upload/file/ with X-Upload-ID
Django->>Django: sanitize filename; require .tar.gz
Django->>UploadStatus: write processing/server_copy
loop Stream upload in 8MiB chunks
Django->>ArchiveDir: write .{archive}.uploading
Django->>UploadStatus: update written/total
Browser->>Django: GET /backups/upload/progress/{upload_id}/
Django-->>Browser: server-side copy progress
end
Django->>ArchiveDir: rename temp upload to {archive}.tar.gz
Django->>UploadStatus: completed
Django-->>Browser: {"status":"uploaded","archive":...,"upload_id":...}
User->>Browser: Click Restore
Browser->>Django: POST /backups/extract/ with archive and X-Restore-ID
Django->>Django: validate password and archive exists
Django->>RestoreStatus: write queued
Django->>Worker: spawn daemon restore thread
Django-->>Browser: {"status":"accepted","restore_id":...}
Worker->>Stage: create unique import staging dir
Worker->>RestoreStatus: preparing_staging 3%
Worker->>ArchiveDir: open uploaded .tar.gz
loop Validate tar members
Worker->>Worker: reject path traversal, symlinks, hardlinks, devices
Worker->>RestoreStatus: validating_archive 5-25%
end
loop Extract tar members
Worker->>Stage: extract files
Worker->>RestoreStatus: extracting_archive 25-60%
end
Worker->>Worker: detect backup root; require metadata.json and metadata/
Worker->>RestoreStatus: preparing_backup_dir 62%
alt Same-named backup directory already exists
Worker->>BackupRoot: delete existing {backup_name} directory
end
Worker->>BackupRoot: move extracted backup as {backup_name}
Worker->>Stage: remove extraction staging directory
Worker->>CHB: POST /backup/restore/{backup_name}?restore_database_mapping=...&rm omitted
CHB->>CH: restore into staging databases
Worker->>CH: atomic cutover with EXCHANGE / RENAME
Worker->>CH: drop staging databases
Worker->>RestoreStatus: completed / canceled / error
loop Browser polls every 1.5 seconds
Browser->>Django: GET /backups/restore/progress/{restore_id}/
Django->>RestoreStatus: read JSON
Django-->>Browser: progress payload
end
Restore Status State Machine
stateDiagram-v2
[*] --> queued: POST restore/extract accepted
queued --> preparing_backup_dir: existing backup restore
queued --> preparing_staging: uploaded archive restore
preparing_staging --> validating_archive
validating_archive --> extracting_archive
extracting_archive --> preparing_backup_dir
preparing_backup_dir --> staging_restore_start
staging_restore_start --> staging_restore_in_progress: clickhouse-backup acknowledged
staging_restore_in_progress --> cutover_start: action status success, no cancel
staging_restore_in_progress --> canceling: cancel requested
staging_restore_in_progress --> failed: action status error or 2h timeout
canceling --> canceled: staging DBs dropped
cutover_start --> cutover
cutover --> completed: all tables exchanged/renamed, staging DBs dropped
cutover --> rollback: cutover exception
rollback --> failed: rollback attempted, staging DBs dropped
preparing_staging --> canceled: cancel flag before extraction
validating_archive --> canceled: cancel flag during validation
extracting_archive --> canceled: cancel flag during extraction
preparing_backup_dir --> failed: invalid/missing metadata
validating_archive --> failed: unsafe archive member
extracting_archive --> failed: extraction error
staging_restore_start --> failed: restore request not acknowledged
failed --> [*]
canceled --> [*]
completed --> [*]
Flow
A normal backup starts in BackupForm, which loads tables from fetch_tables() and currently limits choices to the default database. backup_create() validates the form and calls perform_create(). The browser then polls backup_actions() every three seconds and fetches backup_details() after the matching action succeeds.
A download starts with a POST to backup_download(). Django either returns an existing archive, reports an active worker, or starts a daemon thread. The browser polls every two seconds until the archive is ready, canceled, or failed, then performs the final GET for the FileResponse.
Restore from an existing backup and restore from an uploaded file converge on the same worker:
backup_restore()orbackup_extract()validates the password and queues_run_atomic_restore_job()._queue_restore_job()writes the initial status JSON and starts a daemon thread.- Uploaded archives first pass through
_prepare_backup_from_uploaded_archive(), which validates and extracts into a unique staging directory. By design, an uploaded archive replaces a same-named local backup directory and becomes the source of truth. - The worker restores into mapped staging databases and polls
fetch_actions()every two seconds, with a two-hour deadline. - A cancel request creates a file through
backup_restore_cancel(). Cancellation is cooperative: whileclickhouse-backupis running, the worker waits for a safe stopping point, then drops staging data without changing the previous production tables. - On success,
_perform_atomic_cutover()exchanges or renames tables, reports table-level progress, cleans staging databases, and marks the job complete. backup_restore_progress()exposes the JSON state to either the backup-list page or upload page.
The visible percentages are phase-based rather than byte-perfect: archive validation occupies 5-25%, extraction 25-60%, staging restore starts at 70%, waiting advances from 85-99%, and cutover uses 90-99%. That compromise gives users useful feedback even though the external API does not provide detailed restore progress.
Takeaways
The main lesson was that a backup button is easy; trustworthy lifecycle management is the feature. The progression from tar.extractall() and direct restore to validated extraction, staging databases, atomic table swaps, and reverse rollback was the most important improvement.
I also learned to make compromises explicit. The current implementation uses in-process daemon threads and file-backed JSON instead of Celery or another durable queue. If Django restarts, an active worker disappears and its status may remain stale until cleanup. Download generation has a file lock, but restore still relies on an operating rule of one restore at a time; there is no global cross-process lock. Storage is local disk only, uploaded archives are not checksummed, and there is no ClickHouse backup scheduler or retention policy.
The next improvements are clear:
- Move workers and progress into a durable task queue.
- Add a distributed restore lock and startup reconciliation for interrupted jobs.
- Add object storage, checksums, retention, and scheduled backups.
- Add integration tests for cancellation, cutover failure, rollback, and real
clickhouse-backupresponses. - Split the large
backup.viewsmodule into archive, upload, restore, and status services. - Improve action correlation beyond matching backup names and timestamps.
- Remove unused restore constants and finish aligning every bilingual phase label with the backend state machine.
For a portfolio project, I think the honest limitations are part of the result. The implementation is not pretending to be a distributed backup platform. It is a practical, safer orchestration layer for one appliance-style deployment, with a clear path to become more durable.
Footnotes
- The deployment pins its mirrored
clickhouse-backupimage to2.6.25and ClickHouse to23.9. - Local backups are stored under
/var/lib/clickhouse/backup; archives under/var/lib/clickhouse/archive. - Upload and restore status files are retained for 12 hours; download work artifacts become stale after 10 hours.
- API requests use a 5-second connection timeout and a 30-second response timeout. Restore action polling has a separate two-hour ceiling.
- Existing-backup restore requires the user’s password, as do restore-from-file and deletion.
- Unit tests currently cover internal-artifact filtering, create/restore parameter construction, identifier sanitization, path-traversal rejection, and nested archive layouts. The atomic cutover path still needs broader integration coverage.
- The full component, sequence, and state diagrams are documented in
docs/clickhouse-backup-restore-architecture.md.