The text/Yorkie

Today, I would like to formally introduce Pipcook, which is a machine learning application framework for front-end developers developed by D2C team of Tao Department technology Department. We hope that Pipcook can become a platform for front-end personnel to learn and practice machine learning, thus promoting the process of front-end intelligence. The purpose of this article is to let you know what Pipcook has done and what you want to do after reading it, so as to call on students who have ideas and codes for front-end intelligence to join in, so that front-end intelligence can evolve towards what we imagine.

Why Pipcook?

Imgcook, Ideacook, ReviewCook, etc., are collectively referred to as front-end smart projects. Ideacook, for example, solves the problem of generating product code scenarios from product documentation. Imgcook solves the problem of generating business code from visual scripts. Reviewcook helps us solve some of the intelligent regression verification before the code goes live. Pipcook is positioned to provide a solid and reliable machine learning platform for all of these smart projects, enabling the front end to do all of these things. Before you decide to start using Pipcook, ask yourself the following questions:

  1. Do you want to learn about machine learning?
  2. Do you want to train a model yourself?
  3. Do you want to deploy your model yourself?
  4. Do you want to master the ability to constantly refine your model?

If you have the above ideas, then Pipcook is ready to help front-end developers solve these problems.

Flexible and rich Pipeline

You can get confused about what a Pipeline is, but let’s start with an example:

{
  "plugins": {
    "dataCollect": {
      "package": "@pipcook/plugins-mnist-data-collect"."params": {
        "trainCount": 8000."testCount": 2000}},"dataAccess": {
      "package": "@pipcook/plugins-pascalvoc-data-access"
    },
    "dataProcess": {
      "package": "@pipcook/plugins-image-data-process"."params": {
        "resize": [28.28]}},"modelDefine": {
      "package": "@pipcook/plugins-tfjs-simplecnn-model-define"
    },
    "modelTrain": {
      "package": "@pipcook/plugins-image-classification-tfjs-model-train"."params": {
        "epochs": 15}},"modelEvaluate": {
      "package": "@pipcook/plugins-image-classification-tfjs-model-evaluate"}}}Copy the code

The JSON above is what we call a Pipeline, which is where the piPS in Pipcook come from. We want each application to be made up of different pipelines, and then we break pipelines into different stages for machine learning processes. It also provides different plug-ins at each stage. The advantage of this is that different plug-ins can be changed at any time to get results quickly, and the technical and algorithmic details below the plug-ins are hidden, thus reducing the mental burden of Pipeline users who only need to know what each plug-in does.

The diagram above shows the flow of a conventional Pipeline in Pipcook.

  • Data Collect: Data that you want to use for machine learning training, such as pictures, text, audio, etc.
  • Data Access: Used to transform the Data collected in the previous phase into a sample set that the model can receive;
  • Data Process: Do some additional processing to the Data of the sample, such as unifying the size of the picture, turning the picture into gray scale, text vectorization, etc.
  • Model Define: used to Define the Model you use for training;
  • Model Load: In addition to defining models, we also provide a Model Load plug-in for loading trained models;
  • Model Train: Use the Data Process/Data Access samples to Train the Model defined in Model Define;
  • Model Evaluate: Generally, after the Model training is complete, we need to Evaluate how the Model works, just like our unit test;
  • If the Model evaluation results are acceptable, then the Model can be deployed online. This stage of the plug-in is to help you complete the Model deployment;


Stable and reliable front-end machine learning ecosystem

Pipcook provides a separate runtime for each plug-in. In this runtime, we have specifically added the ability to call Python via JavaScript — Boa.

const boa = require('@pipcook/boa');
const fs = require('fs');
const glob = require('glob').sync;
const acorn = require('acorn');

const { set, len, list } = boa.builtins();
const { DBSCAN } = boa.import('sklearn.cluster');
const { word2vec } = boa.import('gensim.models');

const cwd = process.cwd();
let files = [];
files = files.concat(glob(cwd + '/lib/**/*.js'));

const sentences = [];
const vec2word = {};
const samples = files
  .map((f) = > fs.readFileSync(f))
  .map((s) = > {
    let ast;
    try { ast = acorn.parse(s); } catch (e) {
      console.error('just ignore the error');
    }
    return ast;
  })
  .filter((ast) = >ast ! = =undefined)
  .reduce((list, ast) = > {
    const fn = ast.body.filter((stmt) = > stmt.type === 'FunctionDeclaration');
    list = list.concat(fn);
    returnlist; } []); samples.forEach((sample) = > sentences.push([ sample.id.name ]));

const { wv } = word2vec.Word2Vec(sentences, boa.kwargs({
  workers: 1.size: 2.min_count: 1.window: 3.sg: 0
}));

const X = sentences
  .map((s) = > wv.__getitem__(s)[0])
  .map((v, i) = > {
    const r = [ v[0] * 100, v[1] * 100 ];
    vec2word[r] = samples[i].id.name;
    return r;
  });

const db = DBSCAN(boa.kwargs({ eps: 0.9 })).fit(X);
const labels = db.labels_;
const n_noise_ = list(labels).count(- 1);
const n_clusters_ = len(set(labels));
console.log(n_noise_, n_clusters_, set(labels));
Copy the code

The above code does the following:

  1. Parse the given JavaScript code through the Acorn library (JavaScript) to get all the function names;
  2. Convert the function name of Step 1 to a vector representation by calling Word2Vec in the Genmis.Models library (Python);
  3. The task of sorting function names is accomplished by calling DBSCAN in the Sklearn.cluster library (Python);

As you can see, with @pipcook/boa we have almost perfectly combined the Python and Node.js ecology to complete the machine learning task of classifying (clustering) function names in a given file. More on boa can be found in the following article:

In this article, we will try to explain whether calling Python through Boa and using Python code directly will lead to poor performance. Our answer is no, and at least nearly the same performance as pure Python code.

The picture above is drawn to illustrate this problem. For both Boa and Python, the bottom layer is Python Objects, which contains all Python function definitions, variable definitions, operator definitions, and so on. Then for Python code, It translates Python code into calls to the Python Objects C API through its own interpreter. For Boa, the same is true. V8 parses JavaScript code and maps the operations to Python Objects, so the difference is V8 and Python Interrupter’s efficiency in executing their code. Python and JavaScript have their own optimization strategies and approaches to this point, so it’s pretty much the same. The idea here is that, although it sounds like you’re calling from JavaScript to Python, the actual implementation mechanism is not to convert JavaScript into Python code and then execute it by the Python interpreter, but to work directly with objects in Python, This also reduces the overhead of the middle layer, so people can use it safely. After the performance issues are resolved, let’s talk about what Boa means for front-end intelligence. It will mean that we no longer need nodeJieba or other bridge libraries. Boa is like a door to a new world of machine learning. The most mature, cutting-edge machine learning ecosystem is available at very low cost.

The future – Application frameworks for front-end developers

Through Pipeline and Boa, Pipcook has been able to contribute a number of building blocks to the machine learning path for developers, but we feel that this is still not enough. The core threshold issues have not been addressed, such as deploying a model using Pipeline, and then what? For example, using Boa to complete the development of a machine learning algorithm based on the SkLearn example, then what? If the model doesn’t work well, how can developers optimize it? Is there anything wrong with the sample? Why is the final evaluation of my model so low? These are difficult problems for a front-end engineer with no foundation to solve, and these problems keep most front-end developers out of machine learning. So when Pipcook was released in version 1.0, we introduced a concept called MLApp, a machine learning application framework that, like Vue, React and Angular, is designed to help us develop applications for machine learning. Let’s start with a piece of code:

In MLApp, we declare the ML. Function type for each module that involves machine learning through TypeScript’s type system, and create it using the Ml. create(fn) Function. Then we can use the machine learning API in FN. Finally, you just need to call the created function in any node.js function. After writing the code, don’t run it because it is not possible to run it directly through TS-Node/Node because machine learning itself is divided into training and prediction. However, we do not see any training related code in the code. Why is this? The reason is that MLApp is encapsulated in Pipcook commands as follows:

$ pipcook train example.ts --epoch=15 --sample_path=... . generated the model at example.ts.im ... $ pipcook try example.ts $ pipcook deploy example.ts --eas-config=...Copy the code

As the command above shows, we encapsulate the training and prediction of machine learning, so that the user writing a machine learning application only needs to care about what I’m doing and how it integrates with other node.js features, such as writing a PDF rendering service using Node.js today. All you need to do is call PDF’s render libraries and integrate them with your server routing. The same is true for future front-end machine learning applications. MLApp will express the types of machine learning tasks and data types you need to do, and Pipcook will assist you step by step through data collection, data processing, model training, and finally ML application deployment. In the existing Pipcook 1.0 plans, we plan to add NLP and Vision capabilities to our MLApp API. If you are interested, please join us in issue#33.

How do I join Pipcook

If you have been reading this, you must have developed an interest in machine learning and Pipcook, so we welcome you to join us and contribute to our technology evolution!

In the figure above, we have clearly marked several key nodes:

  • If you’re interested in defining a machine learning application framework, MLApp is a good place to start
  • If you are interested in pipelines and machine learning engineering frameworks, you can start with the Pipcook Client SDK
  • If you are interested in Boa and providing a robust machine learning ecosystem, then our Plugin API is a good place to start

In addition, we have also sorted out some issues for you to quickly participate in:

  • If you would like to participate in the discussion and implementation of our architecture design, you can see it here: github.com/alibaba/pip…
  • If you’d like to contribute to the code, we offer the Good First Issue: github.com/alibaba/pip…

Finally, you are welcome to join our Pipcook Community group to express your thoughts: