Hashing Algorithms

Understand cryptographic hash functions, their properties, and proper use cases including integrity verification and password storage.

A cryptographic hash function takes input of any size and produces a fixed-size output (the hash or digest). Unlike encryption, hashing is a one-way function—you cannot recover the original input from the hash. This makes hashing essential for integrity verification, password storage, and digital signatures.

Properties of Cryptographic Hash Functions

A secure hash function must have these properties:

Property Description Why It Matters
Deterministic Same input always produces same output Enables verification
Fast to compute Hash any size input quickly Practical for large files
Pre-image resistant Cannot find input from hash Protects original data
Second pre-image resistant Cannot find different input with same hash Prevents forgery
Collision resistant Hard to find any two inputs with same hash Ensures uniqueness
Avalanche effect Small input change = completely different hash Hides patterns

The Avalanche Effect

import hashlib

# Tiny change in input = completely different hash
hash1 = hashlib.sha256(b"Hello World").hexdigest()
hash2 = hashlib.sha256(b"Hello World!").hexdigest()  # Added '!'

print(f"'Hello World'  -> {hash1}")
print(f"'Hello World!' -> {hash2}")

# Output:
# 'Hello World'  -> a591a6d40bf420404a011733cfb7b190d62c65bf0bcda32b57b277d9ad9f146e
# 'Hello World!' -> 7f83b1657ff1fc53b92dc18148a1d65dfc2d4b1fa3d677284addd200126d9069
# (Completely different despite one character change)

Common Hash Algorithms

Algorithm Output Size Status Use Case
SHA-256 256 bits ✅ Recommended General purpose, checksums, blockchain
SHA-384 384 bits ✅ Secure Higher security requirements
SHA-512 512 bits ✅ Secure Large data, 64-bit optimized
SHA-3 Variable ✅ Recommended NIST standard, different design than SHA-2
BLAKE2 Variable ✅ Recommended Faster than SHA-2, very secure
BLAKE3 Variable ✅ Recommended Fastest, parallelizable
SHA-1 160 bits ❌ Broken Never use for security
MD5 128 bits ❌ Broken Never use for security

Why MD5 and SHA-1 Are Broken

Both MD5 and SHA-1 have practical collision attacks:

  • MD5: Collisions can be generated in seconds on a laptop
  • SHA-1: SHAttered attack (2017) created two different PDFs with the same SHA-1 hash
# ❌ NEVER use for security
md5_hash = hashlib.md5(data).hexdigest()    # Broken
sha1_hash = hashlib.sha1(data).hexdigest()  # Broken

# ✅ Use SHA-256 or better
sha256_hash = hashlib.sha256(data).hexdigest()  # Secure

Note: MD5 is still acceptable for non-security purposes like checksums for data corruption detection (not malicious tampering).

Use Case 1: File Integrity Verification

Hashes verify that files haven't been modified:

import hashlib

def calculate_file_hash(filepath: str, algorithm: str = 'sha256') -> str:
    """Calculate hash of a file efficiently."""
    hash_func = hashlib.new(algorithm)
    
    with open(filepath, 'rb') as f:
        # Read in chunks for large files
        for chunk in iter(lambda: f.read(8192), b''):
            hash_func.update(chunk)
    
    return hash_func.hexdigest()

def verify_file_integrity(filepath: str, expected_hash: str, 
                          algorithm: str = 'sha256') -> bool:
    """Verify file matches expected hash."""
    actual_hash = calculate_file_hash(filepath, algorithm)
    return actual_hash == expected_hash

# Example: Verifying a downloaded binary
expected = "e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855"
if verify_file_integrity("download.tar.gz", expected):
    print("File integrity verified")
else:
    print("WARNING: File may be corrupted or tampered!")

Command-Line Verification

# Generate hash
sha256sum myfile.tar.gz > myfile.sha256

# Verify hash
sha256sum -c myfile.sha256
# myfile.tar.gz: OK

# Verify downloaded software
echo "expected_hash_here  terraform.zip" | sha256sum -c -

Use Case 2: Password Storage

Never store passwords in plaintext. But regular hashing isn't enough either—you need password hashing functions specifically designed to be slow and memory-hard.

Why Regular Hashes Are Bad for Passwords

# ❌ BAD: Fast hash allows rapid brute force
import hashlib
password_hash = hashlib.sha256(password.encode()).hexdigest()
# Attacker can try billions of guesses per second!

# ❌ ALSO BAD: Unsalted hash vulnerable to rainbow tables
# Same password = same hash for all users

Password Hashing Functions

Algorithm Status Notes
Argon2id ✅ Best PHC winner, memory-hard, recommended
bcrypt ✅ Good Battle-tested, 72-byte limit
scrypt ✅ Good Memory-hard, used in crypto
PBKDF2 ✅ Acceptable Widely supported, needs high iterations
from argon2 import PasswordHasher
from argon2.exceptions import VerifyMismatchError

# Initialize with secure defaults
ph = PasswordHasher(
    time_cost=3,        # Number of iterations
    memory_cost=65536,  # 64 MB memory
    parallelism=4,      # 4 threads
)

def hash_password(password: str) -> str:
    """Hash a password with Argon2id."""
    return ph.hash(password)

def verify_password(password: str, hash: str) -> bool:
    """Verify a password against its hash."""
    try:
        ph.verify(hash, password)
        return True
    except VerifyMismatchError:
        return False

# Example usage
password = "user_secret_password"
hashed = hash_password(password)
print(f"Hash: {hashed}")
# $argon2id$v=19$m=65536,t=3,p=4$...(salt)...$...(hash)...

# Verification
assert verify_password("user_secret_password", hashed) == True
assert verify_password("wrong_password", hashed) == False

bcrypt (Widely Supported)

import bcrypt

def hash_password_bcrypt(password: str) -> bytes:
    """Hash a password with bcrypt."""
    salt = bcrypt.gensalt(rounds=12)  # 2^12 iterations
    return bcrypt.hashpw(password.encode(), salt)

def verify_password_bcrypt(password: str, hashed: bytes) -> bool:
    """Verify a bcrypt password hash."""
    return bcrypt.checkpw(password.encode(), hashed)

# Example
hashed = hash_password_bcrypt("my_password")
print(f"Hash: {hashed}")
# b'$2b$12$...(salt+hash)...'

assert verify_password_bcrypt("my_password", hashed) == True

OWASP Password Storage Recommendations

# OWASP 2024 recommendations
recommended_algorithms:
  first_choice: argon2id
  argon2id_config:
    memory: 19456  # 19 MiB minimum (46 MiB recommended)
    iterations: 2
    parallelism: 1
  
  second_choice: scrypt
  scrypt_config:
    n: 2^17  # CPU/memory cost
    r: 8     # Block size
    p: 1     # Parallelism
  
  third_choice: bcrypt
  bcrypt_config:
    cost: 10  # Minimum (12+ recommended)
  
  legacy_acceptable: pbkdf2
  pbkdf2_config:
    iterations: 600000  # SHA-256
    # or 210000 for SHA-512

Use Case 3: Message Authentication Codes (HMAC)

HMAC combines hashing with a secret key to provide authentication and integrity:

import hmac
import hashlib

def create_hmac(message: bytes, key: bytes) -> str:
    """Create HMAC-SHA256 for a message."""
    return hmac.new(key, message, hashlib.sha256).hexdigest()

def verify_hmac(message: bytes, key: bytes, expected_mac: str) -> bool:
    """Verify HMAC using constant-time comparison."""
    actual_mac = create_hmac(message, key)
    return hmac.compare_digest(actual_mac, expected_mac)

# Example: Webhook signature verification
secret_key = b"webhook_secret_from_provider"
payload = b'{"event": "payment.completed", "amount": 100}'

# Provider sends this signature in header
signature = create_hmac(payload, secret_key)

# You verify it
if verify_hmac(payload, secret_key, signature):
    print("Webhook is authentic")
else:
    print("WARNING: Invalid webhook signature!")

Timing Attack Prevention

Always use constant-time comparison for security-sensitive comparisons:

# ❌ BAD: Vulnerable to timing attack
if actual_hash == expected_hash:  # String comparison leaks timing info
    return True

# ✅ GOOD: Constant-time comparison
import hmac
if hmac.compare_digest(actual_hash, expected_hash):
    return True

Use Case 4: Content-Addressable Storage

Git, Docker, and many distributed systems use hashes as content identifiers:

import hashlib
import json

def content_address(data: bytes) -> str:
    """Generate content-addressable identifier."""
    return f"sha256:{hashlib.sha256(data).hexdigest()}"

# Docker image digests
layer_data = b"... layer tarball ..."
digest = content_address(layer_data)
print(f"Layer digest: {digest}")
# sha256:abc123def456...

# If content changes, digest changes
# If digest is same, content is guaranteed identical

Git Object Hashing

# Git uses SHA-1 (transitioning to SHA-256)
echo "Hello" | git hash-object --stdin
# e965047ad7c57865823c7d992b1d046ea66edf78

# Verify a commit
git cat-file -p HEAD | sha1sum

Hashing in DevOps Contexts

Container Image Digests

# Dockerfile: Pin base image by digest (immutable)
FROM nginx@sha256:abc123def456...  # ✅ Reproducible
# FROM nginx:1.25                   # ❌ Tag can change

# Kubernetes: Reference by digest
spec:
  containers:
    - name: app
      image: myregistry/app@sha256:abc123def456...

Artifact Checksums

# Download and verify Terraform
curl -O https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
curl -O https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_SHA256SUMS

# Verify checksum
sha256sum -c terraform_1.6.0_SHA256SUMS --ignore-missing
# terraform_1.6.0_linux_amd64.zip: OK

Cache Invalidation

import hashlib
import json

def cache_key(config: dict) -> str:
    """Generate cache key from configuration."""
    # Deterministic JSON serialization
    config_str = json.dumps(config, sort_keys=True)
    return hashlib.sha256(config_str.encode()).hexdigest()[:16]

# Same config = same cache key
config = {"version": "1.0", "features": ["a", "b"]}
key = cache_key(config)
print(f"Cache key: {key}")  # Consistent across runs

Common Hashing Mistakes

1. Using MD5/SHA-1 for Security

# ❌ BAD: Broken algorithms
file_hash = hashlib.md5(data).hexdigest()
signature = hashlib.sha1(message).hexdigest()

# ✅ GOOD: SHA-256 or better
file_hash = hashlib.sha256(data).hexdigest()

2. Hashing Passwords with Fast Algorithms

# ❌ BAD: Fast hash for passwords
password_hash = hashlib.sha256(password.encode()).hexdigest()

# ✅ GOOD: Purpose-built password hasher
from argon2 import PasswordHasher
password_hash = PasswordHasher().hash(password)

3. Not Using Salt for Passwords

# ❌ BAD: Unsalted (vulnerable to rainbow tables)
hash = sha256(password)

# ✅ GOOD: Salted (Argon2/bcrypt do this automatically)
# Each password gets unique salt, stored with hash

4. Non-Constant Time Comparison

# ❌ BAD: Timing attack vulnerability
if user_token == stored_token:
    authenticate()

# ✅ GOOD: Constant-time comparison
if hmac.compare_digest(user_token, stored_token):
    authenticate()

Summary

Key takeaways for hashing:

  • Hash functions are one-way—you cannot reverse a hash to get the original input
  • Use SHA-256 or better for integrity verification—avoid MD5 and SHA-1 for security
  • Use password-specific functions (Argon2id, bcrypt)—never hash passwords with SHA-256
  • HMAC for authentication—combine hash with secret key
  • Constant-time comparison—prevent timing attacks with hmac.compare_digest()
  • Content addressing—use hashes as immutable identifiers (Docker, Git)

In the next section, we'll explore TLS/SSL—how encryption and hashing combine to secure data in transit.

Found an issue?