← Back to Table of Contents

Hashing and Data Integrity

We said integrity means detecting if data was tampered with. So how do you check whether a message arrived exactly as it was sent?

The Problem Without Hashing

Imagine you download a file from the internet. How do you know the file wasn’t corrupted during transfer? Or worse, how do you know someone didn’t swap it with a malicious version?

You could compare the file byte by byte with the original. But you don’t have the original. That’s the whole point, you’re downloading it because you don’t have it yet.

You need a way to create a short ā€œfingerprintā€ of the data that you can compare. If the fingerprint matches, the data is intact. If it doesn’t, something changed.

What Is a Hash?

A hash function takes any input, no matter how large, and produces a fixed-size output. Think of it as a fingerprint machine. You feed in a document, and it spits out a unique fingerprint.

The input can be anything: a single character, a paragraph, an entire movie file. The output is always the same size. For SHA-256, the output is always 256 bits (64 hexadecimal characters).

Here’s what makes a good hash function:

SHA-256 in Action

SHA-256 is the most widely used hash function today. Here’s what it looks like:

Input:  "hello"
Output: 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824

Input:  "hellp"
Output: fdd7585e08c4e2afd71dcabdb4636c89d557a3f42db9e2040c8bbd1708aa4ce7

Completely different outputs from inputs that differ by one character. There’s no way to look at the hash and figure out what the input was. And there’s no way to craft a different input that produces the same hash.

How Hashing Solves Integrity

Here’s the basic idea. The sender computes a hash of the message and sends both the message and the hash. The receiver computes the hash of the received message and compares it to the hash that was sent. If they match, the message wasn’t modified.

sequenceDiagram
    participant S as Sender
    participant R as Receiver

    S->>S: Compute hash of message
    S->>R: Message + Hash
    R->>R: Compute hash of received message
    R->>R: Compare computed hash with received hash
    Note over R: Match = data intact
    Note over R: Mismatch = data was modified

This is how software downloads work. The website publishes the SHA-256 hash of the file. You download the file, compute the hash yourself, and compare. If they match, you got the right file.

The Problem with Plain Hashing

There’s a catch. If an attacker can modify the message, they can also modify the hash. They change the message, compute a new hash for the modified message, and send both. The receiver computes the hash, it matches, and they have no idea the message was tampered with.

Plain hashing only works if the hash is delivered through a separate, trusted channel. For software downloads, the hash is on the website (hopefully over HTTPS). But for data flowing over a network connection, we need something better.

HMAC: Hashing with a Key

HMAC (Hash-based Message Authentication Code) solves this. It’s a hash that requires a secret key. Only someone who knows the key can compute the correct HMAC.

The sender and receiver share a secret key. The sender computes HMAC(key, message) and sends the message plus the HMAC. The receiver computes HMAC(key, received message) with the same key and compares. If they match, two things are true:

  1. The message wasn’t modified (integrity)
  2. The message came from someone who knows the key (authentication of the message, not the sender’s identity)

An attacker who modifies the message can’t compute the correct HMAC because they don’t have the key. They can’t just recompute the hash.

Where This Shows Up in TLS

In TLS, after the handshake establishes a shared secret key, every message includes a MAC (or uses AEAD encryption, which bundles encryption and integrity together). This ensures that encrypted data can’t be tampered with in transit.

We’ll see exactly how this works when we get to cipher suites and the handshake. For now, the key takeaway is: hashing gives us integrity, and HMAC gives us integrity that an attacker can’t forge.

Hash Functions You’ll See

Algorithm Output Size Status
MD5 128 bits Broken. Do not use.
SHA-1 160 bits Broken. Being phased out.
SHA-256 256 bits Current standard. Used everywhere in TLS.
SHA-384 384 bits Used in some TLS cipher suites.
SHA-512 512 bits Available but less common in TLS.

MD5 and SHA-1 are ā€œbrokenā€ because researchers found ways to create collisions: two different inputs that produce the same hash. This means an attacker could create a malicious file with the same hash as a legitimate one. SHA-256 and above have no known practical attacks.


Next: Symmetric Encryption

← Previous ChapterNext Chapter →