ECUG CON 2021, organized by ECUG (Effective Cloud User Group), will be held in Shanghai on April 10-11, 2021. At the meeting, Xu Shiwei, CEO of Qiniu Cloud, took
“Data Science and Go+”The topic sharing was published for the topic, telling the understanding of the change of data science, the assumption and planning of the new language Go+, and boldly pointed out
Data science is exploding, and there will only be more new companies like Bytedance. The following is a summary of the speech.

I just chatted that ECUG is getting more and more sophisticated, and in fact I’m becoming more and more like a simple lecturer. Now in its 14th year as a community, the event is also the 14th ECUG CON. It was actually supposed to be held last year, but it was postponed because of the epidemic.

In fact, I have been implementing two ideas in ECUG:

First, let yourself write code constantly. Because every time I come to ECUG, I’m nervous. I can’t have nothing. So it’s a good opportunity to stay on the technology front line;

In addition, the themes I share each year have a certain continuity, showing the context of my own thinking about the future.

Since last year, I have been talking about data science. For the previous three years, I have been talking about some practices on the terminal. The reason is that I think the first era of cloud computing should be machine computing, namely virtual machines; The second generation is cloud native, which I think is a revolution called “infrastructure.” That is, the first phase is resources, and the second phase is infrastructure. The third phase, my judgment, is application computing, which involves front-end and back-end collaboration.

Since last year, my sharing has shifted to data science, and a very important factor and trend is the advent of the data era. Especially after 2017, when a large number of data are digitized, there will be a wide range of applications involving data science in all walks of life.

Last year was also quite coincidental, my brain a heat to make a language out. I used to work in a lot of languages, and the audience is also a few. But that is very clear, never thought that one day can be commercialized. It may happen that there are some companies that use it for commercialization, but basically from the moment of birth, it’s not for commercialization.

I spent a lot of time preaching Go in 2012 because it was hard to hire people as a startup. A good hiring logic is to make people think you’re interesting and the company has a good technical atmosphere. Go+ was one of the first languages that I really wanted to commercialize, but there hasn’t been much hype about it yet. 1.0 hasn’t been released yet. I want to talk about one of my own thoughts about Go+ and data science, and why I think there is a commercialization opportunity for Go+.

The topics I want to talk about today are basically four aspects:

  1. Development of language
  2. The development of data science
  3. Go+ design philosophy
  4. Iterations of Go+ implementations

Development of language

First, let’s talk about language development, a topic of great interest to programmers. I divide the history of language into three parts.

First, the history of static languages. I chose the TOP20 languages, and this is based on the list of the most popular languages at the moment. The TOP20 languages I listed are roughly like this. The first one was C, and it’s still in the top three. The second is C++, Objective-C, Java, C#, Go, Swift, Go+. It is interesting to see that a new and influential static language emerges every 6-8 years or so, which is indicative of productivity iterations.

Second, the development of scripting languages. You will find that they are very different. First Visual Basic, then Python, PHP, JavaScript, Ruby, scripting languages exploded, all around the time of Java, in the first five years of the 1990s. This is a very interesting thing to think about, and there must be some internal reasons behind it.

Third, the development of data science-related languages. But for data science I’m going to go with the TOP50, because the TOP20 is so rare. Also interesting, the first is SQL, the second SAS, Matlab, Python, R, Julia. Python was never intended to be the language of data science, but eventually became the hottest language in artificial intelligence.

Here’s another obvious feature: it’s as big as static languages, so data science is old and long, but it’s not moving that fast. Static languages have iterations about every 6-8 years, but data science languages don’t, and the span in between is extremely large. But I think we’re entering a period of acceleration in data science.

You might think, why am I analyzing the history of language? Several conclusions are key.

First of all, I think scripting languages are the product of a certain historical period, and static languages are more likely to survive in the long run.

Second, data science is the first need of computers, computers are used to do computation. It has a long history but progress has been slow because the era of data explosion never came.

The development of data science

After talking about the evolution of language, let’s talk about the evolution of data science. Data science can also be divided into several stages. The first stage I call the “primitive period”, or the “mathematical software age”, can be basically divided into two characteristics. The first stage is in a limited field, the most typical one is BI (business intelligence); The second finite data scale, typically like Excel, has a very limited number of rows and rows, and that’s pretty much the same with any other software.

What are the characteristics of data science during this period? First of all, it’s not an infrastructure, it’s actually a mathematical application, but it’s very comprehensive and powerful, including statistics, prediction, insight, planning, decision making, and so on.

The second period, which I call the “infrastructure of data science”, really makes data science the infrastructure. The most typical example is the rise of big data. Map/Reduce was a paper published by Google in 2004, followed by Hadoop in 2006 and Spark in 2009. I think this is a phase of the rise of big data and the beginning of the infrastructure of data science. This period is different from the previous mathematical software, which gives priority to large-scale processing power rather than powerful functions. Its functions are relatively limited.

The rise of deep learning and the rise of big data has taken a long time. Deep learning started with TensorFlow in 2015 and Torch in 2017, two of the most well-known deep learning frameworks. Deep learning is essentially the automatic deduction of F function in Y =F (x) through data. We usually use programmers to implement this F, but the core concept of deep learning is how to make the machine automatically generate this F to achieve the best curve fitting. It is an automatic calculation based on the results of the measurement.

If I don’t have Newton’s three laws today, but I have a bunch of measurements, I should be able to find Newton’s three laws in theory, and that’s the logic at the heart of deep learning. It is not a mutual substitution relationship with big data, but a strengthening of the ability, but more about how to make the ability of big data further and more powerful.

There is a perception that the technology behind today’s economic development is really driven by two core factors, one is computing and the other is data.

Data core is the data science we talked about today. Data science is actually a new paradigm, there is a word called “fourth paradigm”, there is a company in China also called the fourth paradigm, we believe that data is a higher order of production capacity, it stands in a higher dimension compared with computing.

The previous two stages of data science, what is the third stage? I think it is the explosion of data science, which is today, in Ma’s words, “DT era”. The original period is a kind of ability to do things in a limited field with a limited data scale. First of all, the future is a whole field. The first field is not limited to business intelligence (BI), the second is large-scale data, and the third is everywhere, everywhere, including cloud, smart phone, embedded device, etc., which will be embedded in what we call data intelligence.

This means that today, the rise of mobile Internet has made many companies very cattle, the popularization of the Internet or the birth of Internet applications, gave birth to BAT. But we know that the big new companies, like Bytedance, aren’t really the success of the Internet, they’re the success of data science. It still can’t be said today that data science is civilian, and the bar is very high.

But as we see, smart applications have already emerged. Smart applications will not be limited to productivity amplification in a local area like Douyin. All industries will be affected by data intelligence, which is the fourth paradigm we just mentioned.

Data and data science are sure to be the backbone of the next generation of productivity, and today there are emerging companies like Bytedance and Kuaishou, but they are just the beginning, not the end.

In the primitive days of data science, data was a byproduct. Imagine that in the BI world, data is just a byproduct, used for later operational decisions.

But today we’re seeing a lot of applications where data is the raw material. This is a very different situation, and that’s why I call it the data science explosion, and that’s why I think Go+ is needed today, and that’s the historical background behind it.

The future of data science must be the fusion of common language and mathematical software, so as to complete the basic setup of data science in the true sense. But in today’s world, the infrastructure of data science is far from complete, which is my own judgment.

Python is pretty good today, so why Go+?

Of course a lot of people are wondering: Python is pretty good today, and it’s so widely used in the field of deep learning, why isn’t Python enough and you need Go+? In fact, I don’t think Python can be an infrastructure. It’s a scripting language, and I think it’s just a need for a particular stage in history.

Data science itself is a revolution in computing power. Even in chips, data can be used to translate computations, which is the core reason why Nvidia defeated Intel. This is even more true in the upper software domain, where a new infrastructure carrier must emerge.

Computing power is inherently a computationally intensive business, and C is behind Python, so Python alone won’t do. Today C and Python underpin deep learning as a whole, but data science must sink further, and with what consequences?

That’s why we need Go+ today! I’ll focus on why I think Go+ has the opportunity to be commercialized. Of course, the commercialization I said does not necessarily mean making money. Let’s not get this wrong. Language may be unprofitable in the eyes of most people, but it does not mean that it is not important.

Go+ design philosophy

After talking about the development of data science, let’s talk about the design concept of Go+. Why is Go+ the way it is today? The programmer is behind computing, while the data scientist, or analyst, is behind data science. The two roles are different, though both are technical jobs. I think it’s relatively easy to train programmers, and there are a lot of programmers out there today, but there are relatively few data scientists, which is why when deep learning took off a few years ago, so called AI engineers got fired, they were much more expensive than programmers. It’s just that data scientists aren’t easy to find.

This role carries the connection between technology and business, and it’s hard to find people who have both. Data science is, first and foremost, a technical job that requires technical competence as well as business understanding. There is still no very systematic ability to train data scientists today, no systematic methodology.

So what is the core idea behind Go+?

First, we try to use Go+ to unify programmers and data scientists, so that they have a common discourse, so that they can have a natural dialogue, which I think is one of the core thinking points of Go+. An important part of Go+’s core logic is to allow two characters to talk to each other in a single language.

On this basis, we extend some design logic. First, Go+ is a static language, and the syntax is fully Go-compatible; Second, it is more script-like in form and has a lower learning barrier than Go. Go may be low in a static language, but it’s not low enough, not as low as Python; Third, it is natural that we should make a language of data science, so it must be supported by a more concise, mathematical grammar; The fourth is a dual engine that supports both static compilation to executable files and compilation to bytecode to interpret execution.

Why did we choose to be syntactically fully Go compatible? First of all, I personally believe that static languages are much more resilient and can survive the cycle of history. It’s also easy to understand that languages need to cross cycles, and the life cycle of languages is usually very long. We can’t just say, what’s popular right now, how do we determine the design of the language, we actually have to find elements that cross the cycle.

Second, why Go? In my opinion, Go has the simplest syntax design and lowest learning threshold of any static language, even if you have never learned a static language before. Our company was one of the first to hire Go programmers, but most of the people we hired didn’t know how to Go. When we used Go, there were really not many people in the world who thought Go was the future language. Our own experience shows that two weeks of study in Go is basically enough, and it is a very low barrier to entry for a static language.

But in terms of the language of data science, the bar is not low enough, and while Go+ is fully compatible with Go, we want it to be even lower. So it’s more script-like in form than Go, because scripts tend to be easier to understand. We want the Go+ learning threshold to be on the same level as Python.

Go+ was born in May or June of last year, and around October, I started to give it a try to three children aged 13 to 14, from sixth grade to first grade. This practice proves that this thing is feasible. They understand the Go+ design and feel comfortable writing code with Go+. It also proves that all of our simplifying efforts on top of Go are very cost-effective.

I’ve briefly listed some of the Go+ syntax here, but not all of it, just some of the expressions that I think are relatively concise. Rational numbers are not in Python, and we think that rational numbers will continue to be very common in data science, especially in lossless numerical operations. Go+ has built-in support for rational numbers. Of course, Map and Slice are basically available in Python.

List comprehensions were also available in Python, but our support for List comprehensions was very complete, basically understanding how to write for loop in Go+ meant understanding List comprehensions. It’s more of a concise expression of some of the general practices of data science. The above is a general grammar, if you have not read GO +, I hope you can have a general understanding of GO +.

What’s interesting about Go+ is that it’s the only language with a dual engine, supporting both static compilation and parsing execution.

Why two engines? Because I think programmers and data scientists have different demands. Data scientists like to step, and you can recall in your mind the mathematical software you have seen, including SAS, Matlab, mathematical software interaction is a step way.

This is not because data scientists are lazy. The programmer understands the logic in his or her head, and we know in his or her head whether the logic is written correctly. However, when a data scientist is doing a calculation, he cannot know whether the result is correct or not, because human beings are much weaker than computers, so he has to step through the calculation results to know what he should do next. This is a point where the working mode of a data scientist and a programmer are completely different.

Because he is doing calculations rather than procedural logic, it is hard not to do single-step execution.

But when data scientists are building a model, to use in the end, when he still wants the implementation of the final delivery is to maximize efficiency, he must do not want the code to run slowly, so this time he would be executed statically compiled, which is why Go + hope designed twin-engine, used for debugging phase and production phase, the work mode.

Iterations on the Go+ implementation

After talking about the design concept of Go+, we entered the last session, the iteration on the implementation of Go+. What has Go+ done so far? Go+ has not released version 1.0 yet, but the syntax is currently supported by 60 or 70 percent of the users. The syntax completion is still good.

Go+ source code, through the scanner into a Go+ Token, and then through a Parser into Go+ abstract syntax number, common languages do this. The abstract syntax tree of Go+ is converted into two branches, one that generates Go code so it can be statically compiled, and the other that generates bytecode parsing and execution. The polymorphism of the branch is achieved by introducing something called an execution specification (exec.spec), which is essentially an abstract interface.

Currently, I personally found a problem during the iteration that it takes a while for someone to get familiar with the whole business when they first join the Go+ team. The part of Go+ that implements the specification is actually an abstract SAX interface, which is event-driven, where I send an event to the receiver, and the receiver handles the event as it wants, which is common in text processing.

The interface we have designed is basically an event-driven pattern to connect different components together. The compiler parses the abstract syntax number and emits events, which are received by the two code-generated modules to do what they want. The schema code is still a little hard to understand, especially if the compiler does some complicated things to make the code hard to understand. If you know the logic behind Go, type derivation is complicated in Go. In fact, most of the complexity of our compiler is caused by type derivation.

I’m currently trying to refactor this logic to make the execution specification part not an abstract interface, but a standard implementation of the DOM, which itself includes the ability to derive types, thus making the compiler relatively simple. I’m going to talk about implementation in more detail than I can talk about today, but we’ll talk about it later when we have a chance.

Now I want to talk about what the center of gravity is for Go+ to do next.

First of all, the most core logic is to release 1.0 version this year, and the most important thing for 1.0 version is to maximize the confirmation of users’ usage paradigm. After 1.0, I hope it will be similar to Go, and the following syntax changes will be relatively few. The most important work at the moment is to figure out which core syntax Go+ requires and try to support it in version 1.0, unless there are specific considerations such as particularly complex syntax features such as the Go paradigm that will be supported in later versions. Similarly with Go+, we may have given up some of the most complex syntax features, but we basically did as much as we could to nail down most of the syntax features we needed in 1.0.

In Go+ 1.0 we will start with a single engine iteration. We will start with a statically compiled engine, and we will iterate over the script engine after 1.0 is released. This is also a decision based on the concept of the user’s usage paradigm first described above.

Finally, we want to run Go+ in a commercial way, and we will be hiring Go+ team members. Welcome to join the Go+ team!

I think the core of Go+ is that it first unifies the language of programmers and data scientists, allowing them to speak to each other naturally. In addition, I strongly believe that Go+ is the next revolution in data science, and I am very excited to do such a thing myself, and welcome anyone who agrees with it to join us.

This is the way to contact us, the first is the address of the project (https://github.com/goplus), the second resume email ([email protected]), the third is my twitter address (@ xushiwei).


Shiwei Xu is the founder and CEO of Qiniuyun, the chief evangelist for GO Greater China, the creator of GO +, and the founder of the eCug community. He worked at Kingsoft and Shanda and has more than ten years of research and development experience in search and distributed storage related technologies. At Kingsoft, he led the architecture design and development of WPS Office 2005 as the Chief Architect. After founding Kingsoft Laboratory, he led the development of distributed storage as the technical director. Later, he joined Shanda Innovation Institute and successfully launched “Shanda Network Drive” and “Shanda Cloud”. Xu Shiwei was selected as the “open source figure in the heart of 33 Chinese open source pioneers in 2020” in 2020.