Notes - 1.3.1 Compression, encryption and hashing

Lossy and Lossless Compression

Compression reduces file sizes to enable faster data transfer over the internet, lower bandwidth consumption, and efficient storage.

There are two main methods:

Method	What happens	Benefits	Drawbacks
Lossy	Some data is permanently discarded	Greatly reduced file sizes; suitable for media streaming where some quality loss is acceptable	Irreversible loss of data quality; not suitable for text or data requiring exact integrity
Lossless	All original data is preserved; nothing is lost	Maintains original data integrity; best for text and data that must be exact	Larger file sizes than lossy; requires high bandwidth when streaming

Recommending a Compression Type

• Choose lossless when data integrity is imperative — e.g. medical images, text documents, archival storage, brand logos
• Choose lossy when small file size or fast transfer is the priority and minor quality loss is acceptable — e.g. social media images, video streaming, music streaming

Exam tip: The right choice always depends on the specific scenario requirements.

Worked example — brand logo vs social media image [4 marks]:
• High-resolution logo → Lossless compression: preserves pixel-perfect quality essential for professional branding, printing, and high-definition media
• Social media campaign image → Lossy compression: significantly reduces file size for faster uploading and loading; the minor quality loss is acceptable given the fast-paced, temporary nature of social media

Run Length Encoding and Dictionary Coding

Run Length Encoding (RLE)

RLE condenses consecutive identical elements into a single value with a count.

• Text example: AAAABBBCCDAA → 4A3B2C1D2A (four A's, three B's, two C's, one D, two A's)
• Image example: a row with 5 red pixels then 3 blue pixels → 5R3B
• Used in bitmap images to compress sequences of the same colour
• Most effective when data contains long runs of the same value

Worked example — decompress 3C3M4C:
Answer: CCCMMMCCCC

Worked example — compress CCCCOLLLCCCCCMOCCCCC:
Answer: 4C1O3L5C1M1O5C

RLE pseudocode — find longest consecutive run of 'C' in a string:

Function longest(s: String) -> Integer:
  max_count = 0
  current_count = 0
  For each char in s:
    If char equals 'C':
      Increment current_count by 1
      If current_count > max_count:
        Set max_count to current_count
    Else:
      Set current_count to 0
  Return max_count

Dictionary Coding

Dictionary coding replaces recurring sequences or words with shorter unique codes, using a lookup dictionary.

1. Scan the data for recurring sequences
2. Build a dictionary mapping each sequence to a short code
3. Replace all occurrences of the sequences with their codes
4. Store the dictionary alongside the compressed data (needed for decompression)

Example:
Original: *"QuickSort is faster than BubbleSort but MergeSort is more stable than QuickSort and BubbleSort."* (95 characters)

Dictionary: QuickSort → Q, BubbleSort → B, MergeSort → M

Compressed: *"Q is faster than B but M is more stable than Q and B."* (53 characters)

• More versatile than RLE — works on any repeated sequence, not just consecutive identical characters
• May require more computational resources than RLE

Exam tip:
• Use RLE when data has lots of repetition of consecutive identical values (e.g. image rows, repeated characters)
• Use dictionary coding when data contains frequently repeated words or sequences that may not be consecutive

Symmetric and Asymmetric Encryption

Encryption converts readable data (plaintext) into an unreadable format (ciphertext) to protect it from unauthorised access. Encryption uses keys — specialised algorithms to scramble and unscramble data.

Symmetric Encryption

• A single shared key is used to both encrypt and decrypt the data
• The sender encrypts using the key; the receiver decrypts using the same key
• Faster — efficient for encrypting large amounts of data
• Key distribution problem — the key must be securely shared between sender and receiver; if intercepted, all messages are compromised
• Suitable for: encrypting large files, databases, backups where the same person/system encrypts and decrypts

Asymmetric Encryption

• Uses a pair of mathematically linked keys: a public key (for encryption) and a private key (for decryption)
• The receiver generates both keys; their public key is shared openly
• Anyone can encrypt data with the public key, but only the private key (kept secret by the receiver) can decrypt it
• Slower than symmetric encryption
• Solves the key distribution problem — sharing the public key does not compromise security
• Suitable for: passwords, bank details, secure email, government communications

Choosing an Encryption Type

Encryption Type	Suitable For	Reason
Symmetric	Large files, databases, bulk data	Fast and efficient; ideal when the same person/system encrypts and decrypts
Asymmetric	Confidential communications, passwords, banking	Highly secure key exchange; solves the key distribution problem

Worked example — 9-mark discussion on encryption and society:

Encryption protects people from data misuse in today's digital society, underpinning e-commerce, secure communications, and personal data protection.

Asymmetric encryption uses a public key for encryption and a private key for decryption. Because the public key can be shared openly without risk, it solves the critical key distribution problem that symmetric encryption faces. It forms the backbone of secure online transactions and enables digital signatures.

Symmetric encryption uses a single key for both operations. It is faster and more efficient for handling large volumes of data, but requires a secure method of sharing the key — if intercepted, all encrypted data is at risk. Asymmetric encryption is slower but eliminates this vulnerability.

Symmetric encryption is best for encrypting large files or databases within a closed, secure environment. Asymmetric encryption is preferred for high-security scenarios such as online banking or secure email where key distribution cannot be assumed to be safe.

Both forms of encryption have societal trade-offs: they enable legitimate privacy and secure commerce, but also allow nefarious actors to communicate securely. The choice between them should weigh speed requirements against security needs.

Hashing

Hashing converts any input data into a fixed-size string of characters called a digest using a hashing algorithm.

Key properties:
• The same input always produces the same hash (consistency)
• Even a tiny change in input produces a radically different hash (sensitivity)
• Hashing is one-way — it cannot be reversed to the original data

Common Hashing Algorithms

Algorithm	Notes
MD5	Widely used but now considered weak; vulnerable to collision attacks
SHA-1	Previously used in SSL certificates; now considered weak
SHA-256	Part of the SHA-2 family; commonly used for cryptographic applications and data integrity; considered secure for most purposes
SHA-3	The most recent SHA family member; designed for higher security levels

Encryption vs Hashing

Feature	Encryption	Hashing
Purpose	Secure data for transmission or storage	Data verification and quick retrieval
Reversibility	Can be decrypted to the original data	Cannot be reversed — one-way function
Keys	Uses keys for encryption and decryption	No keys involved
Speed	Generally slower for strong methods	Generally faster
Use cases	Secure communications, file storage	Password storage, data integrity checks
Output length	Variable — may be same as or longer than input	Always fixed length
Change in output	Small input change → significantly different output	Small input change → significantly different output

Hashing for Password Storage

1. When a user registers, their password is hashed using a hashing algorithm
2. The hash digest is stored in the database — never the plaintext password
3. When the user logs in, the entered password is hashed again
4. The new hash is compared against the stored hash — if they match, access is granted
5. If the database is breached, attackers cannot recover original passwords from the digests

Hashing for Efficient Data Retrieval

A hash table uses hash digests as index values for fast data lookup:
• Hash digests have a fixed length, making comparisons computationally faster than comparing variable-length strings (e.g. email addresses)
• A good hash function distributes data uniformly across the table, enabling efficient retrieval
• Example: a Users table indexed by hash digest allows the system to locate any user's record faster than searching by email address

Hashing for Data Integrity

When data is transferred over a network, it can be corrupted or interfered with. Hashing allows verification:
• Hash the data before sending; hash it again after receiving
• If both hashes match, the data arrived intact
• Comparing two fixed-size hashes is computationally less intensive than comparing entire files

Worked example — hash value for www.foo.co.uk (steps: strip to domain, uppercase, sum ASCII):
1. Remove up to first dot → foo.co.uk
2. Remove from next dot → foo
3. Uppercase → FOO
4. Sum ASCII values: F(70) + O(79) + O(79) = 228

Worked example — write the hash function in pseudocode [5 marks]:

function hash(siteName)
  firstDot = locate(siteName, ".")
  remainingString = siteName[firstDot+1:]

  secondDot = locate(remainingString, ".")
  requiredString = upper(remainingString[:secondDot])

  sum = 0
  for char in requiredString
    sum += asc(char)
  endfor

  return sum
endfunction