From the course: Symmetric Cryptography Essential Training

Simple substitution ciphers

- [Instructor] Let's turn to our first real cryptosystem, simple substitution ciphers. Historically, substitution ciphers started in a simpler form called shift ciphers then evolved into slightly more complex substitution ciphers. Please note, though, these are historical, and they're trivial to break. If you actually want to secure your data, you would never use these. So let's start with shift ciphers. They're called this because the entire alphabet is shifted by a certain number of steps. For example, if we were to shift by one letter, the letter A turns into B, B turns into C, and so on until Z wraps back around to the beginning and turns into A. Historically speaking, Julius Caesar used a variation of this that we now call the Caesar cipher. The Caesar cipher was just a rotation of three letters. There's another system called ROT13 that shifts the letter by 13, so A becomes N, B becomes O, and so on. I've written some code that demonstrates a shift cipher, and I have a function here in C++ called Caesar. It's not really a Caesar cipher, though, unless the key is set to three. So what happens in the main is it's going to prompt the user to enter a string and then enter a key, which is an integer that represents the number of steps that the alphabet will shift. It calls the Caesar function to produce a cipher text, which it outputs. It then asks you if you want to try to find the original message with brute force, and then it shows you the 25 different possible alphabet shifts, which would easily help you to recognize and recover the original message. Hopefully, this simple program demonstrates why any shift cipher is not a good way of securing messages. I've compiled this already, and I'm going to run the program now. Let's enter a string, hello. We'll keep it simple. We'll make this a proper Caesar cipher with a key of three. So H is increased by three up to K, E to H, and so on, which gives us this string. Let's have the program show us all of the different possible rotations and see if we can find the original message in the list. And we see down towards the bottom where it says 23. If we were to rotate an additional 23 characters, we would get the original message, hello, back. One of the problems with a shift cipher is that when you figure out one of the letters, you know all the rest. There's only 25 possible ways that you can change 'em. Let's try something different. Rather than shifting by however many steps, let's consider producing a completely different mapping. So in this partial example, we have A becoming T, B becoming J, and so on. It should be obvious that this is not simply a rotated alphabet. We call this a substitution cipher. With a substitution cipher, it's a little bit harder, but it's still easy enough to do encryption decryption by hand, and it's easy enough to do hand crypt analysis to extract the message even without the key. And, of course, it's nothing at all for a computer to solve it. There's a technique called frequency analysis, which is one of the tools that can help us break substitution ciphers. Letter frequencies in English follow a predictable pattern as you can see in the graph. E is the most common letter in English by far, followed by T and A and so on, all the way down to the very infrequent letters, Q and Z. So in order to demonstrate how to do frequency analysis, I've taken a text file that is the book, Little Brother, by Cory Doctorow, and I've taken it through a sequence of steps. First, I snipped off license bits at the end, and then I cut off the beginning where there was a bunch of other material that was not part of the book itself. Then I removed all spaces and punctuations so that only the letters were left. Then I converted all the uppercase to lowercase, and finally I applied ROT13 to the entire file, which after we open it, you can see is unreadable. If you want to see the exact steps that I used, I've included them in the exercise files if you want to experiment on your own. So what I'm doing here is I'm using grep to put each letter on a line of its own, then counting the number of each letter, and finally, sorting them by the most frequent. So what we can see here, if we look at this file, so what you can see here is the letter R appears most frequently in this ROT13 version, which we can almost certainly conclude corresponds to letter E. So let's take a look at what the actual frequency is from the original file. I do that by running the same command on the edit four file. And we can see that, indeed, the letter E is the most frequently occurring letter in the book. Like you might with the newspaper cryptogram, crypto-quote, or crypto-quip, once you know the frequencies of the letters, you can use that information to solve the puzzle. I expect many of you have done these sorts of word puzzles in the past, or if you don't want to do that by hand, you can use any number of online cryptogram solvers such as this one at rumkin.com. I've created a cryptogram for you, which you probably can't look at, and immediately decide what it says. So I'm just pasting this in here, and I'm going to solve the cryptogram, and you see the results, which should, again, reinforce that even though we have upgraded from shift ciphers, substitution ciphers are still really easy to crack and don't provide real security.

Contents