The Unfolding of Language
I just finished 'The Unfolding of Language.' It is not, and does not intend to be, a textbook on natural language processing. And yet, NLP researchers like Yuval Feinstein have found it interesting and intriguing. The book is all about how languages evolve and morph and what binds them, even in all their diversity. In a world where capturing meaning from unstructured data is becoming increasingly important, it is a useful read on what makes languages (and therefore language processing) 'tick'. It is a bit like, after having read a book on the evolution of physics, one looks under the car hood and says 'Wait a minute, I think I know what's going on here!'
'Unfolding' makes several points in its chapters and many of them have takeaways and/or find resonance in the way we parse and interpret languages. Mr. Deutscher ( a PhD in Math turned linguist) has experience that spans twenty languages, so that there must be truth in what he asserts.
On to the book.
Deutscher draws a distinction between 'content words', words that have meaning in themselves, such as 'kick' or 'rabbit', and 'grammar words' such as 'the', 'which' and 'than' that acquire meaning in context of where they are used. While it is useful to draw meaning from content words, ignoring grammar words carries risk - try parsing 'Lawyers Give Poor Free Advice'. In the absence of a 'the', the meaning can change. For someone who knows Bengali, I was also struck by the article 'The' (always before the noun). The Bengali appendage 'ta' does seem to be a kind of 'the' but after the noun (once again, culture determining mini order, just like the French adjective after the noun). It does not occur after verbs for example. So we could call 'ta' a grammar word, indeed, an article. (That last bit is my own addition)
Takeaway: Grammar words need to be attached to the correct content words or correct groups of content words.
Deutscher also takes us thru the language tree - how languages have common root languages, sometimes going back 5000 or 10000 years. ( here's one). So, if we are good with parsing a language, we should be good with parsing a neighbor language. (sounds like a technique?). Semitic languages, for example, use three letter consonant roots (e.g. l-b-s for cloth). Variations include libaas (attire) or labisa (he wore). The number of meanings from vowel combinations is huge. I am sure this is already being used - just keep a lookup of the consonant roots and use vowels to find out what is being said.
Takeaway: In certain groups of languages (Latin, Arabic etc), consonant forms for single words can be more useful than for other languages.
The three forces that shape language are economy (sounds that require less effort to speak are preferred, or words are joined to create new words), expressiveness (adding words to introduce emphasis - example 'not at all' instead of a simple 'no') and analogy (tending to keep plural forms or word forms similar , for example). (Excessive analogy can be dangerous - a child may call a three pronged fork a 'threek' because he may have seen a four pronged fork the first time.) The impact of this for NLP seems to be beyond me at first read, but perhaps I will think of something.
Metaphor also plays a role. Many words are actually 'physical' in nature and then become abstract. So, people 'devour' Sherlock Holmes stories, when in reality, they are not eating anything. Metaphor seems to be an interesting prelude to words used in a new context. For that matter, when we say 'mammoth deficit' we don't mean a deficit of dinosaurs, but a HUGE deficit.
Takeaway: Metaphor is a powerful tool for abstraction. The sense of the word would follow from the other words in the sentence.
There is also the analogy between space and time - humans, in their language all over seem to have discovered the space time continuum much before A Einstein and modelled language constructs with the idea. So, we speak of ' I am going to' as an action in the future, when actually, we are not going anywhere physically. This cuts across cultures and has implications for the way we deal with prepositions. (Hindi: 'main jaanewala hoon'). Or consider equivalence in prepositions – ‘around the fire’, ‘around lunch time’ – one is spatial, the other temporal. Metaphors also use/ borrow from body parts - in many languages, 'in the middle of' is the same verbiage as 'in the belly of').
Takeaway: Language is often shaped by perception of order and physiology and these are (not surprisingly) culture independent.
Let me come back to the last chapter. There, Deutscher builds a story with the simplest of 'Me Tarzan, You Jane' constructs and then builds on it to get a 'complex' story to show how certain word types (POS) are foundational and the others can (and probably did) come later. The sequence seems to be (noun-verb) --->(earlier types + pronouns and prepositions) ---> (earlier types + adjectives, adverbs, possessives and quantifiers) --> (earlier types + subordination). Indeed, subordination (the use of 'that') seems to be a powerful way to build complex sentences. (The headache, then, is which that is that? :)).
Takeaway: get the key 'word-actors' (words or groups of words, such as noun phrases and verb phrases) first. The rest follow from there.
I said before - it's not a textbook. You won't find Lingpipe (which is my choice of package for many of these things) here or TFIDF. But if you are considering the study of Natural Language Processing, I strongly recommend 'The Unfolding of Language'. You will come away with the feeling that languages are more similar than you thought, and, if you know what differences to look for, your journey will be easier. You have to give it to this Guy.
(Comments and critique welcome. After all, there are many things I still don’t know.)
Chief Data Scientist, AI.Cloud Advisory and Consulting, at Tata Consultancy Services; Author; E-learning specialist in a past life
8yHey, glad you liked it! :-) We should keep in touch.
Growth & SEO consultant
8yDidn't know about the book. Turns out to be very related to the stuff I am working on. Thanks for the suggestion and great description.