# Cybersecurity

Cybersecurity is a Division B and Division C event that was first run as a trial event at the 2021 BEARSO Invitational to replace Ping Pong Parachute. The event consists of two parts: a written test on Cryptography, Web Architecture, and Principles of Cybersecurity and a hands-on task on Cryptography and Programming. The event was also run at the November Scilympiad Practice, Yosemite Invitational, and Science Olympiad at Penn State Invitational.

The event will be run as a national Division C trial event at the 2022 National Tournament.

# Cryptography

## Hash algorithms

A hash algorithm is a one-way function that maps data, such as a string or a file, to a hash, or a "digest" - a string of data that is much shorter in length. Hash functions are always deterministic. If two equal inputs are hashed two separate times, the digest will always be the same. A hash can be used as a checksum to validate that a file has not been altered, since if a single bit of information was changed, the checksum would change. Hash functions are also designed to decrease the risk of hash collisions. Since the hashed digest of an input reduces its size significantly, hash collisions can occur when two inputs map to the same output. Hash functions are used in digital signatures, signing and authentication algorithms, and passwords.

A good hash algorithm has the following characteristics:

• It is hard to find collisions.
• It is irreversible.
• It has to be deterministic.

Passwords are one of the most important applications of hashing algorithms. When a password is inputted, a hash of the password is calculated, and compared to the hashed value of your original password. Thus, no plaintext passwords should be saved server-side, which would reduce the damage in the event of a data breach.

### MD5

The MD5 hashing algorithm produces 16 byte digests (128 bits) with block sizes of 64 bytes (512 bits). It was first created in 1991 by Ronald Rivest. Multiple vulnerabilities have been exposed with the MD5 hashing algorithm, and collisions can be calculated in less than a second on a typical computer. Thus, MD5 is considered extremely cryptographically insecure; however, many programs and applications continue to use it.

### SHA1

SHA1, which stands for Secure Hash Algorithm 1, was created in 1995. The algorithm produces 20 byte digests (160 bits) of data using block sizes of 512 bits. 80 rounds of hashing is done under this hashing algorithm. Hash collisions can be calculated with 2^60.3 to 2^65.3 operations, and collisions have been calculated. In addition, SHA1 is built using the Merkle-Damgård construction, so it is also prone to length extension attacks.

### SHA2

SHA2 is a set of six hash functions, namely SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and SHA-512/256. Depending on the name of the function, each hash algorithm produces a digest size with a different amount of bits. Either 64 or 80 rounds of hashing are performed on the plaintext. Similarly to SHA1, SHA2 hashes are vulnerable to hash length extension attacks. Currently, there are no known full collisions of SHA2 hashes.

### Hash Length Extension Attack

Algorithms that are based on the Merkle-Damgård construction are vulnerable to hash length extension attacks, including MD5, SHA1, and all SHA2 hashes. Because the hash digest is essentially a snapshot of the internal state of the hash, the internal state can be recreated with the hash. In addition, because the hashing algorithm is performed on blocks of data, one can compute HASH(UNKNOWN + CUSTOM MESSAGE) only knowing the length and hash digest of HASH(UNKNOWN) by continuing the hashing algorithm from the known internal state. This attack is significant because it allows attackers to forge requests. One patch of this attack is to hash the unknown secret twice: for instance, HASH(HASH(UNKNOWN)) could be used.

### Hash Collisions

A hash collision has occurred when two plaintexts, hashed with the same algorithm, has produced the same digest. All hashing algorithms have an infinite amount of collisions because the plaintext can be infinitely long, yet the length of the digest is finite. However, because of the length of the hash, it can be extremely hard to produce a collision.

## The XOR Operation

The XOR operator is a binary operator that take two bits of data and outputs one bit of data. XOR sounds for exclusive or; it returns True if the two inputs are different, and False if the inputs are the same. In Java and Python, the XOR operator is denoted with the "^" symbol (exponentiation is represented with "**"). In contrast, SageMath uses the "^" symbol to represent exponentiation, while "^^" is the XOR operator.

This is an XOR table which represents the inputs as well as the outputs:

Input 1 Input 2 Output
True True False
True False True
False True True
False False False

In addition, the XOR operator is commutative (A^B==B^A), associative ((A^B)^C==A^(B^C)), and is its own inverse (A^A==0, B^A^A=B), meaning that it is symmetric and reversible. The latter is especially important, as this is a very simple way to reveal a plaintext or calculate a ciphertext using the same encryption/decryption function.

In cryptography, the XOR cipher is used extremely frequently in stream ciphers and block ciphers, as well as many others. Not only is the XOR operation computationally inexpensive, but it is also theoretically impossible to break as long as the key is secure.

Ciphers that rely purely on the XOR operation, such as stream ciphers such as OTP (one time pad), are susceptible to a known plaintext attack. If a plaintext as well as its corresponding ciphertext is known, then the keystream can be trivially calculated using the operation C^P=K. Thus, it is important for a keystream to be unique and only be used once.

Crib dragging is a method of attacking XOR ciphers with a repeating key. One can use frequency analysis, as well as a brute-force, to calculate the plaintext and "drag" the key along the ciphertext in order to reveal the plaintext message.

## Bases

Different number bases, most often powers of 2, are used extensively in computer science because of their compatibility with computers.

To express a string or a file as a long integer, each character in the string is looked up in an ASCII table and strung together as a string of bytes. For instance, the word "SciOly.org" would be expressed as "53 63 69 4f 6c 79 2e 6f 72 67" in hex (base 16), "01010011 01100011 01101001 01001111 01101100 01111001 00101110 01101111 01110010 01100111" in binary (base 2), and "U2NpT2x5Lm9yZw==" in base 64.

### Binary

A binary digit is simply a 0 or 1. Representing an integer, string, or file as a string of many binary digits is useful because it is one of the easiest ways to store information (only two different "modes" are necessary, such as high voltage and low voltage) and the logic gates for binary values are simple as well.

The hexadecimal number system is often used in place of binary, not only because it is easier for humans to read but also because computers can also easily compute the binary representation of hexadecimal. In this number system, the numbers 0-15 would be expressed as 0123456789ABCDEF.

### Base 64

Base 64 is one of the most compact ways of expressing data as a string, since it stores more data per character. The 64 characters used are as follows:

• 0-25: A-Z
• 26-51: a-z
• 52-61: 0-9
• 62: +
• 63: /

In base 64, every 4 bytes of encoded data can be decrypted to 3 bytes of unencoded data. Thus, when the unencoded data is not a multiple of 3, padding must be added in order to make the encoded data a multiple of 4. The "=" is used as a padding.

## Classical Cryptography

Classical cryptography includes cryptosystems that are no longer in use, as extensive attacks have been developed and can be trivially implemented to reveal the message. Most classical cryptosystems encrypt plaintext, such as messages in English, rather than files. For example, all of the cipher types in Codebusters (excluding the RSA cipher) are classical cryptosystems.

### Substitution Ciphers

In a substitution cipher, each plaintext character is replaced by a ciphertext character. In many methods of substitution, such as those involving the Morse or Baconian alphabets, the plaintext or ciphertext are actually groups of multiple characters. Either way, the plaintext is enciphered by replacing each plaintext unit with a ciphertext unit, and the ciphertext is deciphered by replacing each ciphertext unit with a plaintext unit.

The method of determining how to map plaintext and ciphertext units varies based on the cryptosystem. Some cryptosystems involve algorithms or mathematical formulas that can be used to determine the corresponding ciphertext given a plaintext unit (and vice versa), such as the Hill cipher. Other cryptosystems are simply random mappings from one alphabet to another, such as monoalphabetic substitution ciphers (e.g. Aristocrats).

### Transposition Ciphers

Whereas substitution ciphers replace plaintext/ciphertext units with ciphertext/plaintext units, transposition ciphers retain the same units but rearrange them using some pattern. An example of a transposition cipher would be the Railfence cipher.

### Attacks on Classical Cryptosystems

• Chosen Plaintext Attacks
• Chosen Ciphertext Attacks
• Known Plaintext Attacks

## RSA

The RSA cryptosystem, like many other modern cryptosystems, is reliant on the computational difficulty of factoring prime numbers, though it isn't impossible.

## Post Quantum Cryptography

Quantum computers can break some cryptographic functions (e.g. prime factorization through Shor's algorithm). Post-quantum cryptography uses properties of quantum mechanics, such as entanglement, to protect against quantum computer attacks.

### Shor's Algorithm

Shor's algorithm utilizes classical algorithms which would be computationally slow when run with classical computers. However, when this algorithm is used with a quantum computer, it can factor general prime numbers in a computationally fast manner. It is only able to do so through a reduction, which is the redefinition of a problem in terms of an "easier" problem (that is, one that has a more optimal computational complexity). In particular, Shor's algorithm reduces the factoring problem into a problem relying primarily on greatest common divisor (GCD) computation (usually can be performed quickly using the Euclidean algorithm), modular exponentiation (which has many properties that allow for it to be done relatively quickly using quantum circuit implementations), and period-finding (the focus of Shor's algorithm as it is where the biggest speedup occurs). The itself algorithm is briefly described below.

First, a prime number $\displaystyle{ N }$ is chosen, along with an arbitrary $\displaystyle{ a \in \left(1, N\right) }$. If $\displaystyle{ \text{gcd}\left(a, N\right) = p \neq 1 }$ (that is, $\displaystyle{ a }$ is not coprime with $\displaystyle{ N }$ and shares a factor $\displaystyle{ p }$ which isn't $\displaystyle{ 1 }$), then $\displaystyle{ N }$ has been factored. If it is not, the next step is to find the period $\displaystyle{ r }$ of $\displaystyle{ a^x\ \text{mod}\ N }$ using period-finding, where $\displaystyle{ r }$ must be even. If it is not, a different $\displaystyle{ a }$ must be chosen and the process must be restarted. Since, by definition, $\displaystyle{ a^r = 1\ \text{mod}\ N }$, $\displaystyle{ a^r - 1 }$ must be a multiple of $\displaystyle{ N }$. Although this all may not seem very useful, the magic of Shor's algorithm lies in the following finding: we may factor $\displaystyle{ a^r - 1 }$ into $\displaystyle{ \left(a^{r/2} - 1\right) \cdot \left(a^{r/2} + 1\right) }$. The first term cannot be equivalent to $\displaystyle{ a^r - 1 }$, so it cannot be a multiple of $\displaystyle{ N }$. To continue, it is assumed that the second term is also not a multiple of $\displaystyle{ N }$ (if it were, a different $\displaystyle{ a }$ would have to be chosen and the process would be to be restarted). Since neither terms of this factoring are multiples of N, but their product $\displaystyle{ a^r - 1 }$ is a multiple of N, they must each consist of a distinct prime factor that is also a factor of $\displaystyle{ N }$ since, for a number to be a multiple of another number, it must have the same factors and more. If either $\displaystyle{ a^{r/2} - 1 }$ or $\displaystyle{ a^{r/2} + 1 }$ are prime numbers, then that must be one of the prime factors, which can easily be verified.

Although one execution of this algorithm does not immediately result in all (or even any necessarily) of the prime factors of $\displaystyle{ N }$, knowing that $\displaystyle{ a^{r/2} \pm 1 }$ is/are factor(s) of $\displaystyle{ N }$ provides a "better guess" for the prime factors of $\displaystyle{ N }$. Thus, the entire algorithm can be repeated, instead starting from $\displaystyle{ \text{gcd}\left(a^{r/2} \pm 1, N\right) }$. Although this may seem like a computationally complex step, it is relatively fast. Instead, the slowdown occurs in finding the period $\displaystyle{ r }$. In order to do so, $\displaystyle{ a }$ must be raised to every number between $\displaystyle{ 1 }$ and $\displaystyle{ N }$. Shor's Algorithm is centered around the quantum circuit implementation for period-finding, which involves the use of the quantum Fourier transform (simply the classical Fourier transform applied to superpositions) in order to find the frequency of $\displaystyle{ a^x\ \text{mod}\ N }$. Since the period is related to the frequency, this allows for the computer to calculate the period of the expression very quickly by scanning through all possible powers and finding the frequency at which these powers they occur all at once using quantum superposition states instead of calculating each separately and finding which one is equal to $\displaystyle{ r }$.

# Web architecture

## HTML/CSS/JS

The World Wide Web Consortium (W3C) is an international council that standardizes web architecture through their specifications for HTML, CSS, and Javascript. The Web Hypertext Application Technology Working Group (WHATWG), of which Google, Mozilla, Apple, and Microsoft are leading members, also standardizes many aspects of web architecture. WHATWG specifications are the widely-accepted standard for HTML/CSS/JS development as they are frequently-updated. Mozilla's web docs tend to be the most up-to-date repository for web technologies information. W3Schools and Stack Overflow often don't follow industry best practices. However, it is worth noting that standards are not followed by all browsers, even major ones such as Safari. As a result, cross-browser compatibility is a major issue that must be considered when developing web applications, given that browser developers are free to implement whichever standards they choose however they want.

### APIs

Application Programming Interfaces (APIs) are connections (interfaces) between two or more endpoints, such as programs or systems, that allow for the endpoints to communicate and share functionality. In the context of web browsers, many APIs exist, such as the File API (which allows websites to interact with a device or browser's file system), the Sound API (which allows web developers to interact with a device's audio input and output hardware), graphics APIs (which allow websites to interact with a device's graphics hardware in order to display complex graphics), and the Notifications API (which allows websites to interact with the device's notification manager to send custom notifications triggered by the website or its server). APIs are not inherently supported on a website, browser, or even device. Rather, the website must require access to or include the API library in its code, the browser must implement this functionality, and the device must have the functionality needed (if the API in question requires device access).

One of the major issues concerning API security revolves around improper communications and insecure requests made through APIs, especially those that give the website external access, such as the ability to communicate with other websites or the user's device. In fact, unintended bugs in any piece of code could have adverse effects on other associated processes. In order to protect against these vulnerabilities, modern web browsers, such as those built on top of Chromium, use a method known as "sandboxing", where any insecure processes, including websites and API implementations, run in isolated environments known as sandboxes, which cause no harm to other processes if they are damaged or compromised. To interact with anything outside of its own sandbox, a process must use the functionality implemented and granted by the browser's inter-process communication (IPC) system. Although this is perfect in theory, it is not sufficient in developing a secure browser, especially as many browsers knowingly put multiple processes into the same sandboxes in order to meet system performance requirements.

### Protocols

The Hypertext Transport Protocol (HTTP) is a protocol used by websites and other programs to communicate through the internet. Modern websites use HTTPS, which is simply an extension of HTTP that incorporates encryption through Transport Layer Security (TLS) in order to make HTTP requests more secure.

#### Requests

Requests are messages sent (usually) by user-end applications to some external endpoint, such as a server. There are multiple types of requests, known as methods or verbs.

• GET - Usually used to retrieve information from the endpoint, such as data in a database
• POST - Usually used to send information to the endpoint (more specifically, POST is usually used to create new entries of data)
• PUT - Usually used to send updated information to the endpoint (more specifically, PUT is usually used to replace an existing entry of data)
• PATCH - Usually used to send modifications to the endpoint (more specifically, PATCH is usually used to modify portions of an existing entry of data)
• DELETE - Usually used to request that a specific piece of data be deleted from the endpoint

## Resources

Cybersecurity Rules - Southern California Trial Event (9.30.2020)
USACO
PicoCTF
CyberPatriot