Advanced

Deduplication

Keywords

deduplicationfingerprintSHA-256content-addressed storagefixed-size chunkingvariable-size chunkingRabin fingerprintinline dedupoffline dedupreference countingAsk ChatGPT

Prerequisites

Data Compression

Progress

在前面學過 data compression 消除資料內部的冗餘之後，我們進一步探討如何消除資料之間的重複。Deduplication 利用 cryptographic hashing（如 SHA-256）對資料區塊計算 fingerprint，將內容相同的區塊只儲存一份，實現 content-addressed storage。課堂上我們會比較 fixed-size chunking 與 variable-size chunking（基於 Rabin fingerprint）在 dedup ratio 與計算成本上的差異，並討論 inline dedup（寫入時即時去重）與 offline dedup（背景批次處理）的適用場景。Reference counting 的管理也是實務上的挑戰，因為它直接影響到資料何時可以被安全刪除。

Key Concepts

我理解 cryptographic hashing（如 SHA-256）如何對資料區塊計算 fingerprint，用以識別內容相同的區塊

我理解 content-addressed storage 的概念，即以資料內容的 hash 作為定址依據，使相同內容只儲存一份

我理解 fixed-size chunking 與 variable-size chunking（基於 Rabin fingerprint）在 dedup ratio 與計算成本上的差異

我理解 inline dedup（寫入時即時去重）與 offline dedup（背景批次處理）各自的適用場景與效能影響

我了解 reference counting 在 deduplication 中的作用，以及它如何決定資料何時可以被安全刪除

Deduplication

Keywords

Prerequisites

Progress

Key Concepts

Recommended Resources

Test Your Understanding