How to improve code readability with language knowledge

It is common to see technical articles that introduce coding practices from an engineering perspective. But beyond computer science, there are also numerous links between programming languages and natural languages. Below, we will analyze the relationship between high school level Chinese and English and code readability. If you’re tired of fancy new technology concepts, maybe this back to basics article will give you some inspiration 🤔

Programming language and College Entrance Examination composition

The code written by the master is not so much for computers as for people. True master programmers write code that is clear and understandable, and they document it carefully.

— Code Book

Do we evaluate a piece of code in a similar way as we evaluate an article in an easy-to-read way? Further, is there some non-technical commonality between programming language code and natural language text? Here we take the gaokao essay, which is as rigid as the code, for comparison, and it’s not hard to find some interesting similarities:

It is difficult for codes to correctly predict changes in demand, and it is difficult for gaokao compositions to extract thematic ideas from questions.
Do a good architectural design before writing code, and the college entrance examination essay before writing.
The code should do a good job of modularization, componentization, and the college entrance examination composition also requires appropriate paragraph division, cohesive compact.
The less Warning in the code, the better, and the fewer errors in the college entrance examination composition, the better.
Codes require correct typography and indentation, and gaokao compositions require neat handwriting.
Code requirements as little as possible copy and paste, and the college entrance examination composition is strictly prohibited plagiarism.
There are few people who can write good code, and even fewer who can write good gaokao essays.
…

Is there a lot of similarities? However, the writing style of the college entrance examination, such as narration, discussion and lyric, has been the high-level abstraction of human thinking. Especially, the writing style of lyric, which involves feelings, is difficult to compare its content and concept with logical program code. Moreover, the Chinese language in which the essay is written is not even the English language used by the mainstream programming language, which means that the analysis from the Perspective of Chinese essay may be too grand and not appropriate. Therefore, we’ll look at the relationship between code and language from an English perspective.

Parts of speech and keywords

Both Chinese and English have the concept of part of speech. Words can be classified into nouns, verbs, adjectives, conjunctions, pronouns and other parts of speech. In computer language, for/if/else keywords are built in. So what is the relationship between the parts of speech of these keywords and the nature of computer language?

In fact, the choice of part of speech in keywords varies greatly between different computer languages. Note that programming languages are really just a subset of computer languages. For example, in the classic front three-piece HTML + CSS + JavaScript:

HTML is a markup language.
CSS is a style language.
JavaScript is a programming language.

They all belong to computer languages, but what are the differences in the parts of speech between the keywords they use?

When it comes to HTML, the first thing that comes to mind<head> / <body> / <img> / <table>These labels. The names of these labels arenoun.
When it comes to CSS, the typical code is something like.xxx { background: black; }That’s the rule. hereRule nameIt’s basically like thisbackground / position / width / color 的nounAnd theThe value of the rulesAre common in variousadjectives.
Speaking of JavaScript, exceptfunction / var / classThese three nouns are outside of itsControl flow logicAlmost always byif / else / for / whiletheseFunction wordsControl. And there are plenty of themreturn / break / newsuchThe verb.

We can find that in markup languages and style languages, keywords are almost entirely composed of content words, without the need for function words. What can content words express? It can express what a thing is. So, with HTML and CSS, you tell the machine what you want, not how to get there. For example, you can tell the HTML parser that there is an < IMG > image, without worrying about the format of the image, how to load it, etc. You tell the CSS engine to render the title color red, but you don’t care how the layout is computed, how the GPU renders, etc. Thus, in computer science we refer to HTML and CSS as declarative languages, and one of the distinguishing features of these languages is that their keywords are almost all content words.

Declarative languages are generally easy to understand (by analogy, are HTML and CSS the hardest code to maintain?). This is obviously not the case with the rest of the JavaScript suite. In order to understand the relationship between its keywords and its use as a programming language, we need to make a more detailed classification of its common keywords:

nounfunctionVar class verb importexport extends return break continueDelete switch new try catch throw yield prepositionfor in elseconjunctionsif whilePronouns thisCopy the code

Think of the everyday use of a programming language: telling the browser to request an interface first, doing whatever the format is when it gets the data, sending a new copy of the data if it hits OK and seeing what comes back… The code is written to describe how the problem is done, not what the problem is. Therefore, a programming language needs a large number of function words to express the logical relations between statements in branches, loops and other ways. The style of this code is called procedural programming.

In addition to many function words expressing control flow, programming languages are characterized by a large number of verbs as keywords. If function /var/class allows us to define basic data concepts, a large number of verb keywords provide the ability to manipulate these concepts. For example, we will use import/export to manipulate the conceptual model of modules; Use try/catch/throw to handle the conceptual model of exceptions; Using new/extends to deal with the conceptual model of classes… So procedural programming languages need a lot of verbs to express manipulation of data.

The use of verbs is not only reflected in keywords, but also in the actual coding practice. For example, the print statement in Python 2 becomes the print function in Python 3. Doesn’t that mean that everyday functions and language keywords are interchangeable? So, when we write code to process data, the corresponding code should also be able to be wrapped in a function called a verb. Of course, real-world function customization tends to be very strong and much more specific than the concise keywords found in programming languages, so function names are often more than simply a verb. In this case, using a verb-object phrase structure like getElementById to name functions can work well.

In modern programming languages, in addition to variables, classes corresponding to the noun; In addition to verbs corresponding to functions and methods, prepositions and conjunctions corresponding to control flows, there is a very special class: the pronoun corresponding to this. What role do pronouns play in programming languages? In natural languages, pronouns refer naturally to previously mentioned concepts in context, while this refers to a reference in context. Isn’t that conceptually close?

Unfortunately, from a natural language analogy point of view, the original design of this in JavaScript fails miserably. In the early days of front-end development, this often didn’t point to what you thought was natural in the context of the code, but instead had all sorts of weird rules to point to different contexts. The community has made some improvements to address the mechanical flaws of such languages to make code that uses this easier to write and read. This can also be interpreted as the effect of readability on programming language design.

Sentence Patterns and Expressions

In natural language, we can arrange words into sentences, and sentences have different structures, such as declarative sentences, imperative sentences, interrogative sentences, exclamatory sentences and so on.

By analogy, in programming languages, the concepts of Statement and Expression are usually mentioned in beginner courses of a language. For example, if (color === ‘RYB’) FXCK (); The whole thing is a statement, where color === ‘RYB’ is an expression.

What are the relationships between programming language statements and expressions as opposed to natural language sentences? Imperative sentences, interrogative sentences and exclamatory sentences all contain feelings that are not relevant to our topic. Let’s explore the simplest declarative word order in natural language. We select two of the most representative structures, subject-verb-object and subject-table:

S + V + O subject-verb-object

Children go to school.

This is a very easy to understand subject-verb-object structure. This structure is also very easy to map to code in a programming language:

child.go(school);

A subject corresponds to a data model, a predicate to a function method, and an object to a function parameter. No problem, very clear and easy to read?

S + V + P main table structure

The school is black.

The subject + verb + predicate structure is also very easy to read. But there’s a very big catch: natural languages don’t distinguish between statements and expressions, and the above statement can be understood as both:

school = 'black';It is a kind ofstatementsType, the statement has no return value.
school === 'black'It is a kind ofexpressionType, and the expression returnstrue 或 false.

This creates a huge ambiguity: when translated into code, does this mean assigning school to ‘black’, or determining whether school is ‘black’? In control flow, such ambiguities can cause problems:

if (school = 'black') fxck()
if (school === 'black') fxck()Copy the code

If the school is black, XXX creates confusion in the code. In the above code, the former will XXX regardless of whether the school is black or not, and the latter is a reasonable implementation.

From this example, we can see that when translating readable natural language into code logic, simple declarative sentences of natural language can correspond to statements of programming language, while complex sentences expressing logic have clauses that are closer to the concept of expressions. One of the challenges of programming is untangling the fuzzy logic of natural languages, which requires learning and training to do well.

Tense and synchronous asynchrony

Words can form sentences, and sentences in English have the concept of tense. Coincidentally, the state of data is also a very important concept when a program is running. Here, too, we can make a good analogy.

Synchronous and asynchronous are very common in real-world applications. For example, when a user clicks ok on a page to submit data to the background, the network request and response take time to transfer, and the result of the request is presented asynchronously. So what are the tenses in natural language analogies between synchronization and asynchrony?

You can name variables and functions in the present simple tense, and the flow will be clear. But when it comes to asynchrony, you might find that a variable doesn’t have a value when you access it. So how do you deal with that?

Promise objects are a great tool for handling asynchronous logic. A Promise has pending, Resolved, and Rejected states. Resolve and Reject can be used to migrate between states. Here we express operations by naming them again as verbs, but at this point we can notice that different states are named with the present continuous and present perfect tenses. More generally, we can abstract out rules like this:

State variables that express ongoing are named in the present continuous tense.
State variables that express completed are named in the present perfect tense.

In this way, we can smoothly transfer the natural language model of thinking about tenses into the code of expressing asynchronous states, making the code more readable.

Grammar and compiler

Much of this is just Grammar, but language is an extension of linguistics, a discipline whose study goes far beyond high school Grammar. It is safe to say that linguistics, as a liberal arts field, has had a significant impact on both the design and implementation of programming languages. What’s the basis for that? Let’s start with a branch of linguistics.

Syntax is Syntax, a common compiler error is a SyntaxError. The study of sentence jurisprudence involves the analysis of sentence structure. Early researchers have proposed two methods of analysis, that is, the double cut method and the square bracket method. Like the following sentence:

The teacher abuses a child.Copy the code

By bisecting the whole sentence in half, then separating the predicate, and finally breaking up the noun phrase, we get the result step by step:

First cut The teacher/pinocha child. The // teacher/general // a child. The third time The // teacher/abuses // a /// child.Copy the code

So we can break down the subject-verb-object structure of the sentence.

Square brackets are explained like this:

[3 [1 The teacher] [2 abuses [1 a child]]]Copy the code

We add square brackets to The noun phrase “The teacher” and “a child” to form higher verb-object structures and finally to form sentences.

So what do these two approaches have to do with programming languages? From the above example, we can see that the double sharding method works from the top down, while the square bracket method works from the bottom up. For those of you who know compilation principles, this should immediately remind you of LL and LR algorithms in parsers. LL algorithms recursively process code statements, while LR algorithms reduce lexical elements from the bottom up. So, when the compiler front end converts a code string into a syntax tree, the parsing algorithm operates in a way that is common to linguistic methodologies.

In addition to structural analysis, one of the major contributions of syntax to programming languages is the way it defines a language. In a textbook on sentence science, you’ll find Chomsky’s 1957 book Syntactic Structure, which introduced the concept of generative grammar, which can abstract any language in terms of mathematical symbols. A syntactic rule stating that a Noun Phrase includes both an Adjective and a Noun looks something like this:

NP - > A, NCopy the code

Thus, NP: DAMN school nodes in the syntax tree can be split into children of A: DAMN and N: school. This grammar also applies to computer languages. For example, an HTML Tag contains a TagOpen, a Value, and a TagClose Tag:

Tag -> TagOpen, Value, TagCloseCopy the code

Using this syntax, we can split a string of the form

123

in an HTML tree into three child nodes TagOpen:

, Value: 123, and TagClose:

. In a modern LLVM compiler front end, we only need to provide such syntax rules to define a new computer language of our own. Therefore, Chomsky grammar is also described in detail in Principles of Compilation, which is a computer science concept that crosses literary and scientific lines.

It is worth mentioning that implementing a parser wheel is a fun thing to do. In my compilation principle assignment in college, I implemented a JavaScript version of the LALR parser. This process will give you an insight into how poorly typed languages can be… If you are interested, please try 😀

Semantics and scope

In the linguistic realm, another area closely related to coding appears in semantics. It is, less accurately, the study of the question of what words mean. Like, when is a telescope gross? This actually corresponds to another very important concept in programming languages, namely scope.

In semantics, nouns refer to things, verbs refer to actions, adjectives refer to attributes, and adverbs refer to ways, so words have referential functions. The reference of words can be divided into two types: signified and non-signified. For example, school refers to a specific object and refers to it. However, if refers to the abstract condition and hypothesis relation, and is no reference. Having finger can be further divided into balance finger and variable finger. For example, the Great Wall is a constant reference, while he is a variable reference that the reference object changes depending on the occasion. In different contexts, the actual reference of words will change.

Isn’t the concept of indexical and variable very similar to global variables in programming languages? For example, document is a global variable that always points to the DOM, whereas this points to something different depending on the context of the code. Explicit and unambiguous referential relationships in languages are comparable to the rigorous scope of programming languages.

Natural referential relationships allow us to write a lot more smoothly. Local variables in code can have concise and readable names as long as they are in a restricted scope (such as a function scope). Without a scoping mechanism, we would have to prevent ambiguity and duplication of names with clunky and heavy naming conventions (such as the BEM nomenclature of the earlier front end).

So we can think of the scope mechanism as an effort by modern programming languages to get closer to natural language. When naming a variable, we can consider what the reference of the variable name is and what kind of meaning it has from the perspective of semantics.

conclusion

From a natural language analogy to a programming language, we can really see a lot of things that are overlooked from a technical point of view:

The part of speech of words in code is closely related to the mechanism of programming language.
There are subtle ambiguities between code statements and natural language sentences.
The asynchronous state processing of a program can be analogous to a tense.
The way source code is interpreted actually comes from the grammar of linguistics.
Variable naming based on scoping mechanism can refer to semantics in linguistics.
…

As you can see, programming is not just a boring job for science students, but also closely related to the humanities.

However, some of you might wonder: How on earth can you write better code? This is not a problem that one article can solve. Broadly speaking, it still requires more work, more thinking, and more learning from better code. If this article piqued your interest in some of the intersections between programming and the humanities, that would be enough.

In the course of writing this article, I made an additional discovery: if ARTIFICIAL intelligence replaces humans, programming may be one of the last jobs to be replaced. As we can see from the above discussion, good code requires clear modular breakdown and smooth presentation, both of which have a lot to do with the humanities. From this point of view, programming is not repetitive work, but the precipitation of wisdom.

Since the author is only a [user] of computer science and linguistics, rather than a [researcher], the content of this article is very superficial. I hope to correct the mistakes and omissions of students who are more professional in these two fields.

Finally, this column will be followed by occasional articles that combine programming with real-world phenomena. If you are interested, please follow the author’s nuggets or Github at 😀