Hashing and Data Integrity

We said integrity means detecting if data was tampered with. So how do you check whether a message arrived exactly as it was sent?

The Problem Without Hashing

Imagine you download a file from the internet. How do you know the file wasn’t corrupted during transfer? Or worse, how do you know someone didn’t swap it with a malicious version?

You could compare the file byte by byte with the original. But you don’t have the original. That’s the whole point, you’re downloading it because you don’t have it yet.

You need a way to create a short “fingerprint” of the data that you can compare. If the fingerprint matches, the data is intact. If it doesn’t, something changed.

What Is a Hash?

A hash function takes any input, no matter how large, and produces a fixed-size output. Think of it as a fingerprint machine. You feed in a document, and it spits out a unique fingerprint.

The input can be anything: a single character, a paragraph, an entire movie file. The output is always the same size. For SHA-256, the output is always 256 bits (64 hexadecimal characters).

Here’s what makes a good hash function:

One-way. Given the hash output, you can’t figure out the input. You can’t reverse-engineer the document from its fingerprint.
Deterministic. The same input always produces the same output. Hash “hello” a million times, you get the same hash every time.
Avalanche effect. Change one bit of the input, and the output changes completely. “hello” and “hellp” produce wildly different hashes. There’s no pattern, no way to predict how the output changes.
Collision resistant. It’s practically impossible to find two different inputs that produce the same hash. Every document has a unique fingerprint.

SHA-256 in Action

SHA-256 is the most widely used hash function today. Here’s what it looks like:

Input:  "hello"
Output: 2cf24dba5fb0a30e26e83b2ac5b9e29e1b161e5c1fa7425e73043362938b9824

Input:  "hellp"
Output: fdd7585e08c4e2afd71dcabdb4636c89d557a3f42db9e2040c8bbd1708aa4ce7


Completely different outputs from inputs that differ by one
character. There’s no way to look at the hash and figure out what the
input was. And there’s no way to craft a different input that produces
the same hash.
How Hashing Solves Integrity
Here’s the basic idea. The sender computes a hash of the message and
sends both the message and the hash. The receiver computes the hash of
the received message and compares it to the hash that was sent. If they
match, the message wasn’t modified.
sequenceDiagram
    participant S as Sender
    participant R as Receiver

    S->>S: Compute hash of message
    S->>R: Message + Hash
    R->>R: Compute hash of received message
    R->>R: Compare computed hash with received hash
    Note over R: Match = data intact
    Note over R: Mismatch = data was modified
This is how software downloads work. The website publishes the
SHA-256 hash of the file. You download the file, compute the hash
yourself, and compare. If they match, you got the right file.
The Problem with Plain
Hashing
There’s a catch. If an attacker can modify the message, they can also
modify the hash. They change the message, compute a new hash for the
modified message, and send both. The receiver computes the hash, it
matches, and they have no idea the message was tampered with.
Plain hashing only works if the hash is delivered through a separate,
trusted channel. For software downloads, the hash is on the website
(hopefully over HTTPS). But for data flowing over a network connection,
we need something better.
HMAC: Hashing with a Key
HMAC (Hash-based Message Authentication Code) solves this. It’s a
hash that requires a secret key. Only someone who knows the key can
compute the correct HMAC.
The sender and receiver share a secret key. The sender computes
HMAC(key, message) and sends the message plus the HMAC. The receiver
computes HMAC(key, received message) with the same key and compares. If
they match, two things are true:

The message wasn’t modified (integrity)
The message came from someone who knows the key (authentication of
the message, not the sender’s identity)

An attacker who modifies the message can’t compute the correct HMAC
because they don’t have the key. They can’t just recompute the hash.
Where This Shows Up in TLS
In TLS, after the handshake establishes a shared secret key, every
message includes a MAC (or uses AEAD encryption, which bundles
encryption and integrity together). This ensures that encrypted data
can’t be tampered with in transit.
We’ll see exactly how this works when we get to cipher suites and the
handshake. For now, the key takeaway is: hashing gives us integrity, and
HMAC gives us integrity that an attacker can’t forge.
Hash Functions You’ll See



Algorithm
Output Size
Status




MD5
128 bits
Broken. Do not use.


SHA-1
160 bits
Broken. Being phased out.


SHA-256
256 bits
Current standard. Used everywhere in TLS.


SHA-384
384 bits
Used in some TLS cipher suites.


SHA-512
512 bits
Available but less common in TLS.



MD5 and SHA-1 are “broken” because researchers found ways to create
collisions: two different inputs that produce the same hash. This means
an attacker could create a malicious file with the same hash as a
legitimate one. SHA-256 and above have no known practical attacks.

Next: Symmetric
Encryption
← Previous ChapterNext Chapter →

Algorithm	Output Size	Status
MD5	128 bits	Broken. Do not use.
SHA-1	160 bits	Broken. Being phased out.
SHA-256	256 bits	Current standard. Used everywhere in TLS.
SHA-384	384 bits	Used in some TLS cipher suites.
SHA-512	512 bits	Available but less common in TLS.

Hashing and Data Integrity

The Problem Without Hashing

What Is a Hash?

SHA-256 in Action

How Hashing Solves Integrity

The Problem with Plain Hashing

HMAC: Hashing with a Key

Where This Shows Up in TLS

Hash Functions You’ll See