Sketch of a new English orthography
Everyone knows that English spelling is a mess, but it’s one of those things that most people just accept as locked in, like the similarly irrational QWERTY keyboard. Because the English language has largely functioned by accumulating bits and pieces of other languages, its orthography is a bricolage of different orthographies and its speakers need to have at least an intuitive understanding of many words’ origins in order to pronounce or spell them correctly. It’s so bad that Thorsten Veblen wrote about its use for identifying people with and without the privilege of leisure time in which to learn its arcane rules.
The position of English speakers is analogous to the situation of the Korean people in the 15th century. Despite probably being an Altaic language, Korean was written with Chinese characters largely repurposed for their phonetic qualities. This made for an awkward fit, and was a substantial barrier to wider literacy. King Sejong commissioned the creation of the Hangul alphabet, which was custom made for the Korean language and is probably one of the cleverest writing systems ever developed. Each character represents a syllable and can contain a vowel alone, a consonant plus a vowel, or a consonant plus a vowel and another consonant. Given a list of Romanized Korean words and their equivalent in Hangul, a motivated 외국인 (phonetically “way-kook-in”, meaning foreigner) can learn to pronounce Korean words in under an hour. This is how I learned it when I was an English teacher in Incheon, with a list of my students’ names as my Rosetta Stone.
There have been similar efforts in English. The simplified spelling movement of the early 20th century got a few u’s expelled from American dictionary entries for words like color, but had ultimately settled into accepting the standard Latin alphabet. Maybe it was that alphabet’s fundamental contradictions with spoken English that ultimately meant simplified spelling never had the kind of influence its very influential supporters imagined.
Subsequent reformers tried to develop custom scripts for English. One was Unifon, which used a modified, all-caps version of the Latin alphabet as its basis, and added in specialized glyphs. Whether intentional or not, I see the ambiguity around the pronunciation of this alphabet’s name as a sly commentary on English spelling — is it “YOO-ni-FAHN” or “YOO-ni-FOHN”? Looking at the alphabet itself, though, it should be obvious that it wasn’t intended to be hand-written but instead to be printed. Shavian took up the mantle as a script designed for the English language that was meant to be both phonetically unambiguous and written by hand.
As different as these two writing systems look, they shared the premise that a writing system should have a one-to-one relationship with the sounds of spoken language — “one letter, one sound” was the mantra.
I think this is the wrong approach. It’s more or less trivial to faithfully record the phonemes of a given utterance, but there’s a reason we don’t all write in the International Phonetic Alphabet. Not only would it be inconveniently long, but the lack of standardization across countries and regions would make reading a chore. On the other hand, there’s a reason we don’t all use shorthand, most forms of which omit all but leading vowels and reduce words so much that they become cryptic to everyone except the more experienced users.
A desirable solution to the quandary of English orthography, then, is one that can be used around the world and that balances phonetic fidelity with concision. We might also consider whether there might be other phonetic information that it would be valuable to embed in a new system of writing. Perhaps the most important information in spoken English that is totally neglected by our writing system is the system of stressed and unstressed vowels. We learn where to place these emphases totally through exposure, which presents challenges for English learners. So an ideal writing system for English would also embed information about the stressed and unstressed syllables.
The desire for including emphasis in the writing system goes hand in hand with the fundamental observation that most confusion around spelling comes from vowels. Sure, there are a few cases where consonants have inconsistent pronunciations, like the first versus second “c” in the word “circus”, but they’re nothing compared to the nineteen vowel sounds inconsistently recorded by our mere five vowels and the combinations thereof. (Honestly, if we could just harmonize the spelling of the final syllables of words like “consonant” and “dependent”, I’d be thrilled — words with the final “ent”/“ant” syllable are the orthographic equivalent for me of trying to plug in a USB-A cable with the correct orientation.)
Finally, there are sound combinations that occur often enough in English that it wouldn’t be unreasonable to just give them their own character. Imagine, for example, if we consistently replaced the “and” sound with the ampersand: we could eat a s&wich & sip some br&y on vacation in the &es Mountains! With a marginal expansion of the number of characters in the writing system, we could potentially make strides in both concision and clarity. This would bring elements of a syllabary into our writing system and transform it from an alphabet, but such an appeal to purity is hardly worth considering when we have so much to gain!
Summing up, my wishlist for written English would be as follows:
Has a reasonable balance between word length and unambiguous pronunciation
Embeds information on emphasis
Sorts out the absolute mess of a situation our vowels are in right now
Includes characters for high-frequency sound combinations
And, because we don’t want it to be too hard to learn, must have no more than, say, 40 characters (not counting punctuation)
Greater minds than your author’s have been set to this task and haven’t achieved much, so I wondered what kind of system a neural network might create for English. I undertook this in a spirit of learning more about language and neural networks, so I don’t expect this to be for anything but fun. Although if this work calls out to your inner spelling reform evangelist, feel free to take up the banner.
Steelmanning the Latin alphabet
I first wanted to get a best-case scenario of how a neural network might perform at the task of predicting phonemes from written words. For this task, I used the CMU Pronouncing Dictionary and the Natural Language Toolkit (nltk) for Python. The Pronouncing Dictionary contains an exhaustive list of English words and their pronunciations in a set of key:value pairs. Pronunciations are depicted in the ARPAbet, which is a phonetically unambiguous system developed for the production of text-to-speech systems. This system is not entirely human-friendly to read — for example, the vowel sound in “boat” is represented as OW — but it does include information on primary stresses, secondary stresses, and unstressed vowels, which is perfect for meeting desideratum number 2 above. It does this by pairing a number with each vowel sound. Two represents primary stress, one secondary stress, and zero unstressed vowels. The word “sugars”, for example is represented as ['SH', 'UH1', 'G', 'ER0', 'Z'], where the vowel is stressed and the second is unstressed. I randomly selected 10,000 words from the dictionary to serve as my training and test dataset. (Complete code is here.)
I created (with no small amount of help from ChatGPT) a Seq2Seq neural network. The Seq2Seq model is a specialized recursive neural network for converting one sequence into another. In this case, of course, it was designed to convert spelled-out words into their phonemes.
The model first processes each character of the input words into a vector embedding. Then, it uses bidirectional LSTM in the encoder to produce hidden representations of the vector with reference to the letters that appear both before and after the input character. Next, the decoder uses a single-direction LSTM initialized on the output of the encoder LSTM to create the probability space describing the potential relationship between each letter and phoneme. The model concatenates the decoder output with the output of an attention mechanism, which is applied to the hidden states in both the encoder and decoder. All of this is fed into a dense layer with softmax activation for the prediction of the phoneme.
I made two substantial mistakes in building this model. First, I failed to learn about start and stop characters. This resulted in sequences of phonemes that neglected the first letter of the input word and tended to repeat the final phoneme up to the maximum sequence length. Second, I had initially used accuracy as a metric, when I should have used sparse categorical accuracy. Switching required a bit of refactoring to change from one-hot encoded labels to integers, but it seemed to result in much better performance.
The model performed surprisingly well, with overall accuracy of about 94%. As we might expect, it didn’t struggle with consonants. Looking over the output, it seemed like the most common mistakes were around whether a given “g” made a soft or hard sound, but even that was pretty rare. Instead, the vowels and the emphasis were where it made its most frequent errors.
I used Uberduck to create speech samples from a few notable examples of the ARPAbet output of the model. Uberduck directly accepts ARPAbet input, but has the quirk of mostly catering to fan communities. These examples nicely demonstrate some of the flaws that the model picked up.
Here’s Rainbow Dash from My Little Pony: Friendship is Magic pronouncing the word “secularism” in a somewhat unusual way:
And here’s Bob the Tomato with the model’s predicted pronunciation of the word “serious”:
If these words are recognizable, it’s because of their consonants. The emphasis is totally wrong in these words, though.
An alphabet of my own
Transformer models work by creating many-dimensional embeddings of textual units, whether those are phonemes, sub-word tokens, words, or sentences. The closer two words are in this space, the more likely they are to be able to substituted in the same place within a text. Thus, in an English-Spanish embedding, the words ‘dog’ and ‘perro’ would be very close to one another, just as in an English-only embedding, the words ‘dog’ and ‘puppy’ would be expected to be close.
In a character- or phoneme-level embedding, this principle still holds. Phonemes that are nearby in the embedding space are likely to appear in similar contexts in words — that is, to be surrounded by similar phonemes. This might help us to achieve a more concise orthography if it allows us to combine sounds into a single letter that can be distinguished by context. For example, if ‘b’ and ‘m’ were both represented by the character ☃, then we’d spell both ‘bath’ and ‘math’ as ‘☃ath’. From context, it should be easy to determine which meaning is intended. This compression of sounds that neighbor each other in our embedding would, in turn, free up space in our alphabet for phonemes that really should stand alone, as well as combinations of phonemes that are high-frequency enough to merit their own characters.
To create these embeddings, I used the same 10,000 pairs of words and phonemes from the CMU Pronunciation Dictionary. I first separated the numbers from the vowel phonemes and dropped the zeros entirely, leaving the unaccented phonemes without a number. Then, I tokenized both the Latin alphabet and the set of phonemes using unigrams. The unigram encoding method first collects all the individual letters and phonemes into its sets of tokens, then adds the highest frequency combinations of letters and phonemes. This is important not only because it has the potential to improve prediction and create better embeddings, but also because it gets me closer to my goal of having common sound combinations represented by a single character. For this exercise, I somewhat arbitrarily chose a vocabulary size of 100 for the spellings and 85 for the phonemes.
After training this transformer model to predict phonemes from Latin spellings, I created a two-dimensional representation of the phonemes using t-dimensional stochastic neighbor embeddings (t-SNE). From this reduced dimensionality representation of the embeddings, I used k-nearest neighbors to create 40 clusters of phonemes, and assigned each cluster a unique, human-readable character.
The transformer model only achieved 77% accuracy on the task of predicting phonemes from spelling, but since I primarily was interested in obtaining the phonemes’ embeddings, this didn’t seem too egregious.
A few things stuck on to me on first reviewing the list of symbols and phonemes. First, 16 of the 40 symbols were assigned to a single phoneme. This made it convenient for me to retrospectively reassign members of the Latin alphabet to some of these symbols. There was also a unique character for strong emphasis on a vowel sound, as indicated by the number 2 in the ARPAbet. That said, a lot of symbols were assigned to phonemes that included accent information. Finally, certain symbols had more than their fair share of phonemes. A single symbol was assigned to a total of 7 phonemes — 'AA1-N', 'AO1', 'AW1', 'B', 'OW1', 'T', and 'V'. Interesting that all the vowel sounds were softly accented, but it was hard to not see this as an ominous sign for readability.
You can view the complete list of symbol / phoneme combinations here.
Sample texts
To demonstrate this novel spelling system, I’ve used a few lines from the standard text from the George Mason University Speech Accent Archive, which includes all the sounds of English. Vowel phonemes without numbers indicate unaccented sounds in the following, and hyphens indicate representation in a single character — for example, ☾ indicates the ‘IH0-K’ sound in ‘manic’. For ease of reading (sic), I’ve put a • between each word in the phonetic versions of each sentence.
Please call Stella.
ⱬ╪⁑❑ KⴼL !⸔A
P L-IY 1 Z • K AO1 L • S-T EH1-L AH
Ask her to bring these things with her from the store:
◬<K #⁑ ⴼ⫫ ⴼ#⬗┳ ð⬗❑ $⬗┳❑ W⬗$ #⁑ F#AM ð$ !#
AE1 S K • HH ER • T UW1 • B R IH1 NG • DH IH1 Z • TH IH1 NG Z • W IH1 TH • HH ER • F R AH1 M • DH AH • S-T AO1-R
Six spoons of fresh snow peas,
<⬗K< <ⱬ⫫N❑ A⁑ⴼ F#◬⏃ <Nⴼ ⱬ@❑
S IH1 K S • S P UW1 N Z • AH 1 V • F R EH1 SH • S N OW1 • P IY1 Z
five thick slabs of blue cheese,
F⟕ⴼ $⬗K <L◬ⴼ❑ A⁑ⴼ ⴼL⫫ ⶻ@❑
F AY1 V • TH IH1 K • S L AE1 B Z • AH1 V • B L UW1 • CH IY1 Z
and maybe a snack for her brother, Bob.
⸔⸔ Mℤⴼ⟕ A <N◬K F# #⁑ ⴼ#A⁑ð⁑ ⴼ⁑ⴼ
AH-N D • M EY1 B IY • AH • S N AE1 K • F AO1-R • HH ER0 • B R AH1 DH ER • B AA1 B
Breaking every bone in my body against Chesterton’s fence
Hooooooo boy.
Let’s start out with the good stuff. This novel alphabet lets us spell words with 33% fewer characters than the Latin alphabet. There are also some startlingly clever spellings in the above — for example, ‘!#’ for store or ‘!⸔A’ for Stella. This suggests to me that there might be value in combining frequent sound combinations into single characters.
On the other hand, I feel like I stared lexical madness in the face and lost. Two of my assumptions in particular now seem very, very wrong — namely, my interest in including information on emphasis and my idea that a single symbol can stand for consonants, vowels, and combinations of the two.
Embedding information on three types of vowel accent created a need for more letters than were really practical. In effect, it took our nineteen vowel sounds and made 57 out of them. Take the phoneme ‘AA’, which corresponds to the vowel sound in the word ‘balm.’ There are three separate representations of ‘AA’ in standalone formats — A, ⁑, and ↂ (in order of increasing emphasis). There are also two more characters representing combinations with the ‘AA’ phoneme, but these can only be used in conjunction with the softly accent version: ◬ for ‘AA1-R’ and ⴼ for ‘AA1-N’. Combining unaccented and strongly accented with either the ‘R’ or ‘N’ sound requires two symbols, though. This sort of irregularity is easily perceived as hostile.
There are several cases in which the need for emphasis adds an additional character to a word as well. Think of the word ‘of’, which would be spelled A⁑ⴼ in my system — the middle character just indicates the degree of emphasis. I could imagine working with this to some extent, perhaps by dropping standalone accent marks in monosyllabic words, but that would mean adding more rules to what I had hoped would be a fairly simple system.
The character ⴼ provides a good demonstration of how messy my system ended up. It has seven different phonemes and phoneme combinations assigned to it. The word ‘ⴼⴼⴼ’ can be pronounced as either ‘vote’ or ‘bought’ (or ‘taught’ or ‘boat’, etc.), depending on the context. If I were to repeat this exercise, I’d restrict clusters to only vowel or consonant combinations, but not both. It’s just too much to keep track of otherwise.
If there’s something that I ended up appreciating about English spelling from this exercise, it’s the interactions between letters. We Anglophones have largely internalized rules like how an ‘e’ at the end of a word changes a vowel sound from short to long, or how we can tell the sound that the letter ‘c’ makes by the vowel after it. These interactions between letters are a pretty clever means of compressing phonetic information, and I think my system could have benefited from building in some system by which the surrounding letters can indicate which sound was meant by a given character. In other words, I’ve probably underestimated the cognitive burden of using semantic context alone to determine which sound a symbol is meant to represent.
I want to wrap up by talking a little bit about what I tried that didn’t work for me and where I could see a neural network approach to spelling reform producing a reasonable system.
The first thing that I tried for this project was an autoencoder, which is a neural network defined by having the same input and output. The encoder of this algorithm creates a smaller representation of the input which is then expanded by the decoder. The idea behind this system is that compression forces the network to learn relevant characteristics of the input. My idea was that I would take our sequences of phonemes with 71 possible values and process them in an autoencoder with a bottleneck of size 40. I had imagined that I would be able to extract and encode the representations in the bottleneck to somehow correspond with the 40 symbols in my novel alphabet. I hadn’t accurately imagined what these representations would look like, though. All were 40 places long, and many of those places were occupied by zeros. This would have worked if I had been content to have words that were all 40 letters long, but that would have been even worse that what I ended up with.
Next, I tried a variational autoencoder (VAE). This is similar to an autoencoder, but the one I created stored the information in a 40-dimensional Gaussian distribution. VAEs are generative neural networks and can create permutations on input data, such as celebrity faces in this example. I simply couldn’t figure out how to work with them, though. I had imagined that it would be a little like working with a diffusion model, where the seed affects the output. But, whether that supposition was correct or not, I just wasn’t up to the task of working with them yet.
Finally, just before resorting to the transformer model I ended up with, I tried a generalized adversarial network (GAN). My idea was that I would provide a sequence of phonemes and a set of symbols that would be available to the generator. I’d imagined that the generator could produce a map of correspondences between phonemes and symbols that could be used to produce spellings. With each step of the model, I’d train a very simple recurrent neural network to act as the discriminator, to see how well it can predict phonemes from spellings. It was pretty early in my exploration of this method that I realized it would take a very long time to run, and so gave up.
I see now that the GAN would have run into the same problems as my transformer model did, though: my fixation on vowel emphasis and neglect of interaction between symbols would have likely led to something not too different from what I ended up with. That said, I think if I had the capacity, inclination, and processing power to produce a less rule-bound GAN, that would be my best bet for creating a neural network that could create a usable English orthography.
Conservatives like to say that we shouldn’t get rid of a tradition or institution until we understand why it was put in place. I think this is a pretty poor argument, and puts the onus entirely on people who have ideas for improvement. The past wasn’t a great place for most people, and a lot of those traditions are just there to consolidate someone’s power or wealth. English orthography came as a result of colonization by the Roman empire, but has now spread around the world as a result of the dominance of English itself. I think it’s an important gesture, at least, to think about how an institution such as the English language could be revised to better suit its current role as an international lingua franca. This little exercise of mine ended up showing me that parts of English spelling that I’d thought of as flaws are better thought of as necessary compromises. The question is always ‘Compared to what?’, though, and in this case, conventional English spelling came out the unquestioned victor. That said, I’m not convinced that no better system is possible.