Filed record / Code

Clickhouse Backup, the hard way

Published Class Code

Table of Contents


Situation

I was working on a Docker-based log analysis platform where ClickHouse stores the production logs. The system already included Django, Nginx, ClickHouse, and clickhouse-backup, but backup operations were still too close to infrastructure commands: users needed a practical way to create, inspect, download, upload, restore, and delete backups from the web interface.

What looked like a small CRUD feature became an operations problem. Backups can be large, uploads pass through several layers, and restore is destructive. A frozen request or a vague “restore failed” message was not acceptable for production log tables.

The final feature creates all-table, selected-table, or schema-only backups; lists metadata and actions; builds downloadable archives; uploads and restores .tar.gz files; reports progress; supports cancellation and deletion; and presents the workflow in English and Persian.

Problem

The first implementation proved the basic idea, but also exposed most of the dangerous edge cases.

Archive downloads were originally created synchronously with a tar subprocess. Large backups held HTTP requests open, duplicate clicks could duplicate work, and partial archives could look complete. Upload progress could also reach 100% while Django was still copying the file.

The first restore-from-file prototype trusted direct tar extraction, introducing path-traversal and link risks. It required a shadow/ directory, which broke schema-only backups, and early versions extracted files without completing the ClickHouse restore.

The bigger issue was restore semantics. Calling perform_restore() directly against production could replace schema objects before the new data was known to be usable. Stopping at the wrong moment could leave production half-restored.

There were smaller integration bugs too:

  • Internal directories such as .restore_status, .restore_cancel, and .import-* could leak into the backup list.
  • The create API did not always return the exact status code or JSON shape the UI expected.
  • “No tables selected” had historical schema-only behavior that needed to be made explicit.
  • Missing runtime directories caused first-run upload or restore failures.
  • Long uploads needed matching Nginx body-size, timeout, and buffering settings.
  • Session expiry could turn an AJAX restore request into an HTML login redirect.
  • Progress files, locks, temporary archives, and cancel flags could become stale after interruption.

Solution

I kept clickhouse-backup as the backup engine and built an orchestration layer around it.

The API wrapper in backup.utils handles the infrastructure calls:

  • fetch_backups(), fetch_actions(), and fetch_tables() parse the API’s newline-delimited JSON.
  • fetch_details() reads and formats metadata.json.
  • perform_create() translates the form into name, table, and schema parameters while accepting any successful 2xx response.
  • perform_restore() supports database/table mappings, selected tables, schema/data modes, and an optional rm flag.
  • perform_delete() removes the local backup through the API.
  • is_internal_backup_name() hides UI runtime artifacts from users.

The Django views add orchestration. Operation IDs make logs traceable. sanitize_archive_name(), sanitize_upload_id(), and sanitize_restore_id() constrain user-controlled names. _ensure_backup_runtime_dirs() handles clean installations, while _write_json_atomic() prevents polling clients from reading half-written JSON.

For downloads, backup_download() became a state machine. It streams files into .tar.gz.part, publishes throttled progress, and atomically renames the result. An exclusive .lock prevents duplicate workers; .cancel supports cancellation; .progress exposes status; and ten-hour-old artifacts are stale. Completed archives are reused.

Uploads stream in 8 MiB chunks to a hidden .uploading file and are renamed after the copy. The UI separates network transfer from server finalization, polls backup_upload_progress(), keeps the session alive, and stores the archive name in localStorage.

safe_extract_backup_archive() validates every member before extraction. _validate_tar_members() rejects paths outside staging, links, and device entries. _detect_extracted_backup_root() accepts flat or singly nested archives, and is_valid_backup_dir() requires metadata.json plus metadata/ while allowing schema-only backups without shadow/.

Architecture

The system has four cooperating parts:

  1. Nginx accepts uploads up to 20 GiB, disables proxy buffering, and uses one-hour request/read timeouts.
  2. Django provides authentication, forms, endpoints, background workers, progress files, archive validation, and the final database cutover.
  3. clickhouse-backup runs as a separate container with its HTTP API on port 7171 and performs backup creation, deletion, and staging restores.
  4. ClickHouse and shared local disk hold production data, backup directories, archives, staging databases, and UI state.

Both Django and clickhouse-backup mount /var/lib/clickhouse. Backups live under backup/, archives under archive/, and restore state under .backup_ui_state/.

The most important architectural decision was not restoring directly into production. _load_backup_tables_from_metadata() discovers the tables, _build_restore_db_mapping() creates names such as restore_stg_<id>_<database>, and perform_restore(..., rm=False, restore_database_mapping=...) restores into those staging databases.

Only after external restore success does _perform_atomic_cutover() touch production. Existing tables use EXCHANGE TABLES; missing tables use RENAME TABLE; missing databases use the Atomic engine. If a later table fails, _rollback_cutover() reverses completed operations before _cleanup_staging_databases() removes staging databases.

This is atomic per table rather than one transaction across the entire backup, but it dramatically reduces the interval in which production is exposed.

flowchart LR
    user[User / Browser]
    nginx[Nginx container\nTLS reverse proxy\n20G body limit\n3600s timeouts\nproxy buffering off]

    subgraph web["web container: Django app"]
        urls["backup.urls\n/backups/* routes"]
        views["backup.views\nrequest handlers"]
        forms["BackupForm\nloads table choices"]
        utils["backup.utils\nHTTP client for clickhouse-backup"]

        archiveThread["Archive creation thread\ncreates .tar.gz for download"]
        restoreThread["Atomic restore thread\nsingle-restore operating rule"]

        uploadStatus["Upload status JSON\narchive/.upload_status/{upload_id}.json"]
        restoreStatus["Restore status JSON\n.backup_ui_state/restore_status/{restore_id}.json"]
        cancelFlag["Restore cancel flag\n.backup_ui_state/restore_cancel/{restore_id}.cancel"]
    end

    subgraph chBackup["clickhouse-backup container"]
        chbApi["HTTP API server\n0.0.0.0:7171"]
        chbActions["Actions history\n/backup/actions?last=20"]
    end

    subgraph clickhouse["clickhouse container"]
        chServer["ClickHouse server\nTCP 9000"]
        prodDB["Production databases/tables"]
        stageDB["Restore staging databases\nrestore_stg_{restoreid}_{source_db}"]
    end

    subgraph disk["Shared local disk mounted as /var/lib/clickhouse"]
        backupRoot["backup/\nclickhouse-backup local backup dirs\n{backup}/metadata.json\n{backup}/metadata/"]
        archiveRoot["archive/\nUI download archives\nuploaded .tar.gz files\n.part/.lock/.progress/.cancel"]
        uiState[".backup_ui_state/\nrestore status, cancel flags,\nrestore extraction staging"]
        chData["ClickHouse table data"]
    end

    user -->|HTTPS UI/API requests| nginx
    nginx -->|proxy /backups/*| urls
    urls --> views
    views --> forms
    forms -->|GET /backup/tables| utils
    views --> utils

    utils -->|GET /backup/list| chbApi
    utils -->|GET /backup/actions?last=20| chbApi
    utils -->|GET /backup/tables| chbApi
    utils -->|POST /backup/create| chbApi
    utils -->|POST /backup/restore/{name}| chbApi
    utils -->|POST /backup/delete/local/{name}| chbApi

    chbApi --> chbActions
    chbApi -->|create/restore/delete| chServer
    chbApi -->|read/write local backups| backupRoot

    views -->|read metadata.json| backupRoot
    views -->|write uploaded archives| archiveRoot
    views -->|write upload status| uploadStatus
    views -->|write restore status| restoreStatus
    views -->|write cancel flag| cancelFlag

    views -->|spawn| archiveThread
    archiveThread -->|tar backup dir| backupRoot
    archiveThread -->|write .part then .tar.gz| archiveRoot

    views -->|spawn| restoreThread
    restoreThread -->|extract uploaded archive into staging| uiState
    restoreThread -->|replace same-named backup dir| backupRoot
    restoreThread -->|POST restore with database mapping| chbApi
    restoreThread -->|poll actions every 2s, max 2h| chbActions
    restoreThread -->|direct ClickHouse client| chServer
    restoreThread -->|EXCHANGE or RENAME tables| prodDB
    restoreThread -->|drop staging DBs| stageDB

    chServer --> prodDB
    chServer --> stageDB
    chServer --> chData
    backupRoot --- disk
    archiveRoot --- disk
    uiState --- disk
    chData --- disk

Backup Creation and Listing

sequenceDiagram
    autonumber
    actor User
    participant Browser
    participant Nginx
    participant Django as Django web / backup.views
    participant Form as BackupForm
    participant Utils as backup.utils
    participant CHB as clickhouse-backup API
    participant CH as ClickHouse
    participant Disk as /var/lib/clickhouse/backup

    User->>Browser: Open /backups/
    Browser->>Nginx: GET /backups/
    Nginx->>Django: proxy request
    Django->>Utils: fetch_backups()
    Utils->>CHB: GET /backup/list
    CHB->>Disk: list local backup directories
    CHB-->>Utils: newline-delimited JSON
    Utils-->>Django: backups excluding internal artifacts
    Django-->>Browser: items_list.html

    User->>Browser: Click "Create a Backup"
    Browser->>Django: GET /backups/create/
    Django->>Form: instantiate BackupForm
    Form->>Utils: fetch_tables()
    Utils->>CHB: GET /backup/tables
    CHB->>CH: inspect tables
    CHB-->>Utils: table list
    Utils-->>Form: only Database == default
    Django-->>Browser: create_form.html

    User->>Browser: Submit create form
    Browser->>Django: POST /backups/create/
    Django->>Utils: perform_create(form data)
    alt All tables selected
        Utils->>CHB: POST /backup/create?name={name}
    else Selected tables
        Utils->>CHB: POST /backup/create?name={name}&table=default.t1,default.t2
    else Schema-only or no table options
        Utils->>CHB: POST /backup/create?name={name}&schema=true
    end
    CHB->>CH: create backup from ClickHouse
    CHB->>Disk: write backup metadata and data
    CHB-->>Utils: acknowledged or error
    Utils-->>Django: result
    Django-->>Browser: JSON acknowledged / failed

    loop Every 3 seconds until matched action succeeds or errors
        Browser->>Django: GET /backups/actions/
        Django->>Utils: fetch_actions()
        Utils->>CHB: GET /backup/actions?last=20
        CHB-->>Browser: actions via Django
    end

    Browser->>Django: GET /backups/details/{name}
    Django->>Disk: read {name}/metadata.json
    Django-->>Browser: backup details for new row

Download Archive Flow

sequenceDiagram
    autonumber
    actor User
    participant Browser
    participant Django as Django web / backup_download
    participant Worker as Archive thread
    participant BackupDir as /var/lib/clickhouse/backup/{name}
    participant ArchiveDir as /var/lib/clickhouse/archive

    User->>Browser: Click Download
    Browser->>Django: POST /backups/download/{name}/
    Django->>BackupDir: verify backup directory exists
    Django->>ArchiveDir: check {name}.tar.gz, .lock, .part, .progress, .cancel

    alt Archive already ready
        Django-->>Browser: {"status":"ready","progress":1.0}
        Browser->>Django: GET /backups/download/{name}/
        Django-->>Browser: FileResponse {name}.tar.gz
    else Archive missing and lock acquired
        Django->>Worker: spawn daemon archive thread
        Worker->>ArchiveDir: create exclusive .lock
        Worker->>BackupDir: inventory files and total bytes
        Worker->>ArchiveDir: write .progress JSON
        loop tar each file
            Worker->>ArchiveDir: write {name}.tar.gz.part
            Worker->>ArchiveDir: update .progress throttled by time/bytes
            Worker->>ArchiveDir: check .cancel
        end
        Worker->>ArchiveDir: rename .part to .tar.gz
        Worker->>ArchiveDir: remove .lock/.progress/.cancel
        Django-->>Browser: {"status":"in_progress","progress":0}
    else Another archive worker is active
        Django-->>Browser: current status from lock/progress files
    end

    loop Every 2 seconds while in_progress
        Browser->>Django: GET /backups/download/{name}/?status=1
        Django->>ArchiveDir: read .progress / inspect .tar.gz
        Django-->>Browser: ready / in_progress / cancelled / error / missing
    end

    opt User cancels archive creation
        Browser->>Django: POST /backups/download/{name}/ action=cancel
        Django->>ArchiveDir: touch .cancel if running
        Worker->>ArchiveDir: notice .cancel, remove .part, write cancelled status
        Django-->>Browser: {"status":"cancelled"}
    end

    opt Stale artifacts
        Django->>ArchiveDir: remove .lock/.part/.progress/.cancel older than 10 hours
    end

Restore From Existing Backup

sequenceDiagram
    autonumber
    actor User
    participant Browser
    participant Django as Django web / backup_restore
    participant Status as restore_status JSON
    participant Worker as Atomic restore thread
    participant Utils as backup.utils
    participant CHB as clickhouse-backup API
    participant CH as ClickHouse
    participant BackupDir as /var/lib/clickhouse/backup/{name}
    participant Cancel as restore_cancel flag

    User->>Browser: Click Restore on existing backup
    Browser->>Django: GET /backups/details/{name}
    Django->>BackupDir: read metadata.json
    Django-->>Browser: metadata details

    User->>Browser: Submit password
    Browser->>Django: POST /backups/restore/{name}/ with X-Restore-ID
    Django->>Django: validate password
    Django->>BackupDir: require metadata.json and metadata/
    Django->>Status: write queued status
    Django->>Worker: spawn daemon restore thread
    Django-->>Browser: {"status":"accepted","restore_id":...}

    Worker->>Status: preparing_backup_dir 10%
    Worker->>BackupDir: parse metadata/{db}/*.json into db_to_tables
    Worker->>Worker: build staging DB mapping restore_stg_{id}_{db}
    Worker->>Status: staging_restore_start 70%
    Worker->>Utils: perform_restore(name, rm=false, restore_database_mapping=...)
    Utils->>CHB: POST /backup/restore/{name}?restore_database_mapping=...&rm omitted
    CHB->>CH: restore backup into staging databases
    CHB-->>Utils: acknowledged with optional operation_id

    loop Every 2 seconds, max 2 hours
        Worker->>Cancel: check cancel flag
        Worker->>Utils: fetch_actions()
        Utils->>CHB: GET /backup/actions?last=20
        CHB-->>Worker: recent actions
        Worker->>Status: staging_restore_in_progress 85-99%
    end

    alt Staging restore action success and no cancel requested
        Worker->>Status: cutover_start 90%
        loop For each table in backup metadata
            Worker->>CH: CREATE DATABASE IF missing
            alt Production table exists
                Worker->>CH: EXCHANGE TABLES production AND staging
            else Production table missing
                Worker->>CH: RENAME TABLE staging TO production
            end
            Worker->>Status: cutover progress 90-99%
        end
        Worker->>CH: DROP DATABASE IF EXISTS each staging DB
        Worker->>Status: completed 100%
    else Staging restore action error
        Worker->>CH: DROP DATABASE IF EXISTS each staging DB
        Worker->>Status: error, previous production tables preserved
    else Poll timeout
        Worker->>CH: DROP DATABASE IF EXISTS each staging DB
        Worker->>Status: error, previous production tables preserved
    else Cancel requested before cutover
        Worker->>CH: DROP DATABASE IF EXISTS each staging DB
        Worker->>Status: canceled, previous production tables preserved
    else Cutover failure
        Worker->>CH: rollback completed EXCHANGE/RENAME operations
        Worker->>CH: DROP DATABASE IF EXISTS each staging DB
        Worker->>Status: error, previous production tables preserved as far as rollback succeeds
    end

    loop Browser polls every 3 seconds
        Browser->>Django: GET /backups/restore/progress/{restore_id}/
        Django->>Status: read JSON
        Django-->>Browser: queued / processing / completed / canceled / error
    end

    opt User cancels restore
        Browser->>Django: POST /backups/restore/cancel/{restore_id}/
        Django->>Cancel: touch {restore_id}.cancel
        Django->>Status: cancel_requested=true
        Django-->>Browser: {"status":"cancel_requested"}
    end

Restore From Uploaded Archive

sequenceDiagram
    autonumber
    actor User
    participant Browser
    participant Django as Django web
    participant UploadStatus as upload_status JSON
    participant ArchiveDir as /var/lib/clickhouse/archive
    participant Worker as Atomic restore thread
    participant Stage as .backup_ui_state/restore_staging
    participant BackupRoot as /var/lib/clickhouse/backup
    participant CHB as clickhouse-backup API
    participant CH as ClickHouse
    participant RestoreStatus as restore_status JSON

    User->>Browser: Select .tar.gz and password
    Browser->>Django: POST /backups/upload/file/ with X-Upload-ID
    Django->>Django: sanitize filename; require .tar.gz
    Django->>UploadStatus: write processing/server_copy
    loop Stream upload in 8MiB chunks
        Django->>ArchiveDir: write .{archive}.uploading
        Django->>UploadStatus: update written/total
        Browser->>Django: GET /backups/upload/progress/{upload_id}/
        Django-->>Browser: server-side copy progress
    end
    Django->>ArchiveDir: rename temp upload to {archive}.tar.gz
    Django->>UploadStatus: completed
    Django-->>Browser: {"status":"uploaded","archive":...,"upload_id":...}

    User->>Browser: Click Restore
    Browser->>Django: POST /backups/extract/ with archive and X-Restore-ID
    Django->>Django: validate password and archive exists
    Django->>RestoreStatus: write queued
    Django->>Worker: spawn daemon restore thread
    Django-->>Browser: {"status":"accepted","restore_id":...}

    Worker->>Stage: create unique import staging dir
    Worker->>RestoreStatus: preparing_staging 3%
    Worker->>ArchiveDir: open uploaded .tar.gz
    loop Validate tar members
        Worker->>Worker: reject path traversal, symlinks, hardlinks, devices
        Worker->>RestoreStatus: validating_archive 5-25%
    end
    loop Extract tar members
        Worker->>Stage: extract files
        Worker->>RestoreStatus: extracting_archive 25-60%
    end
    Worker->>Worker: detect backup root; require metadata.json and metadata/
    Worker->>RestoreStatus: preparing_backup_dir 62%

    alt Same-named backup directory already exists
        Worker->>BackupRoot: delete existing {backup_name} directory
    end
    Worker->>BackupRoot: move extracted backup as {backup_name}
    Worker->>Stage: remove extraction staging directory

    Worker->>CHB: POST /backup/restore/{backup_name}?restore_database_mapping=...&rm omitted
    CHB->>CH: restore into staging databases
    Worker->>CH: atomic cutover with EXCHANGE / RENAME
    Worker->>CH: drop staging databases
    Worker->>RestoreStatus: completed / canceled / error

    loop Browser polls every 1.5 seconds
        Browser->>Django: GET /backups/restore/progress/{restore_id}/
        Django->>RestoreStatus: read JSON
        Django-->>Browser: progress payload
    end

Restore Status State Machine

stateDiagram-v2
    [*] --> queued: POST restore/extract accepted
    queued --> preparing_backup_dir: existing backup restore
    queued --> preparing_staging: uploaded archive restore

    preparing_staging --> validating_archive
    validating_archive --> extracting_archive
    extracting_archive --> preparing_backup_dir

    preparing_backup_dir --> staging_restore_start
    staging_restore_start --> staging_restore_in_progress: clickhouse-backup acknowledged

    staging_restore_in_progress --> cutover_start: action status success, no cancel
    staging_restore_in_progress --> canceling: cancel requested
    staging_restore_in_progress --> failed: action status error or 2h timeout

    canceling --> canceled: staging DBs dropped

    cutover_start --> cutover
    cutover --> completed: all tables exchanged/renamed, staging DBs dropped
    cutover --> rollback: cutover exception
    rollback --> failed: rollback attempted, staging DBs dropped

    preparing_staging --> canceled: cancel flag before extraction
    validating_archive --> canceled: cancel flag during validation
    extracting_archive --> canceled: cancel flag during extraction
    preparing_backup_dir --> failed: invalid/missing metadata
    validating_archive --> failed: unsafe archive member
    extracting_archive --> failed: extraction error
    staging_restore_start --> failed: restore request not acknowledged

    failed --> [*]
    canceled --> [*]
    completed --> [*]

Flow

A normal backup starts in BackupForm, which loads tables from fetch_tables() and currently limits choices to the default database. backup_create() validates the form and calls perform_create(). The browser then polls backup_actions() every three seconds and fetches backup_details() after the matching action succeeds.

A download starts with a POST to backup_download(). Django either returns an existing archive, reports an active worker, or starts a daemon thread. The browser polls every two seconds until the archive is ready, canceled, or failed, then performs the final GET for the FileResponse.

Restore from an existing backup and restore from an uploaded file converge on the same worker:

  1. backup_restore() or backup_extract() validates the password and queues _run_atomic_restore_job().
  2. _queue_restore_job() writes the initial status JSON and starts a daemon thread.
  3. Uploaded archives first pass through _prepare_backup_from_uploaded_archive(), which validates and extracts into a unique staging directory. By design, an uploaded archive replaces a same-named local backup directory and becomes the source of truth.
  4. The worker restores into mapped staging databases and polls fetch_actions() every two seconds, with a two-hour deadline.
  5. A cancel request creates a file through backup_restore_cancel(). Cancellation is cooperative: while clickhouse-backup is running, the worker waits for a safe stopping point, then drops staging data without changing the previous production tables.
  6. On success, _perform_atomic_cutover() exchanges or renames tables, reports table-level progress, cleans staging databases, and marks the job complete.
  7. backup_restore_progress() exposes the JSON state to either the backup-list page or upload page.

The visible percentages are phase-based rather than byte-perfect: archive validation occupies 5-25%, extraction 25-60%, staging restore starts at 70%, waiting advances from 85-99%, and cutover uses 90-99%. That compromise gives users useful feedback even though the external API does not provide detailed restore progress.

Takeaways

The main lesson was that a backup button is easy; trustworthy lifecycle management is the feature. The progression from tar.extractall() and direct restore to validated extraction, staging databases, atomic table swaps, and reverse rollback was the most important improvement.

I also learned to make compromises explicit. The current implementation uses in-process daemon threads and file-backed JSON instead of Celery or another durable queue. If Django restarts, an active worker disappears and its status may remain stale until cleanup. Download generation has a file lock, but restore still relies on an operating rule of one restore at a time; there is no global cross-process lock. Storage is local disk only, uploaded archives are not checksummed, and there is no ClickHouse backup scheduler or retention policy.

The next improvements are clear:

  • Move workers and progress into a durable task queue.
  • Add a distributed restore lock and startup reconciliation for interrupted jobs.
  • Add object storage, checksums, retention, and scheduled backups.
  • Add integration tests for cancellation, cutover failure, rollback, and real clickhouse-backup responses.
  • Split the large backup.views module into archive, upload, restore, and status services.
  • Improve action correlation beyond matching backup names and timestamps.
  • Remove unused restore constants and finish aligning every bilingual phase label with the backend state machine.

For a portfolio project, I think the honest limitations are part of the result. The implementation is not pretending to be a distributed backup platform. It is a practical, safer orchestration layer for one appliance-style deployment, with a clear path to become more durable.

Footnotes

  1. The deployment pins its mirrored clickhouse-backup image to 2.6.25 and ClickHouse to 23.9.
  2. Local backups are stored under /var/lib/clickhouse/backup; archives under /var/lib/clickhouse/archive.
  3. Upload and restore status files are retained for 12 hours; download work artifacts become stale after 10 hours.
  4. API requests use a 5-second connection timeout and a 30-second response timeout. Restore action polling has a separate two-hour ceiling.
  5. Existing-backup restore requires the user’s password, as do restore-from-file and deletion.
  6. Unit tests currently cover internal-artifact filtering, create/restore parameter construction, identifier sanitization, path-traversal rejection, and nested archive layouts. The atomic cutover path still needs broader integration coverage.
  7. The full component, sequence, and state diagrams are documented in docs/clickhouse-backup-restore-architecture.md.