0 / introduction

Welcome to Notebook(Jupyter) computing platform. The Notebook (Jupyter) computing platform is based on JupyterLab and provides many customized functions. Notebook instructions: From example application, to familiar with the development environment, novice guide and demo code, how to install extension package, how to query data, process files, process data, conduct data analysis, data visualization, modeling, report making and sharing, and how to carry out collaborative development (support project management, version management, Github), The notebook demo is provided in the instructions for viewing, cloning, and secondary development based on the provided template.Copy the code

1/ Platform introduction

<1> Why do WE need a notebook

The typical scenario of a data-driven company is a recommendation engine for external users, which recalls and sorts by user profiles, content hotspots, etc., all of which rely heavily on data, strategies and models produced by data engineers, data analysts, and data scientists. Notebook is one of several tools that the Data Platform group and others have developed to support your development efforts. The Notebook itself has great flexibility and comprehensiveness. It can be used for data development, data query, data processing, data analysis, data mining and other work. The management can also use the Notebook to view the report and analysis and make corresponding decisions. As a simple side-by-side comparison, Notebook offers a web-based (web mode) one-stop, collaborative (Git), interactive (input input together) computing toolkit compared to other data products or tools (development platforms).Copy the code

< 2 > what’s notebook

Now that we've described Notebook as a web-based one-stop, collaborative (git, project management, code version management), interactive (both in and out) computing toolkit, let's take a look at the platform's architecture and capabilities. The first physical layer mainly consists of: 1. Docker-based instance allocation (every developer needs to apply for his/her own instance when developing) 2. Ceph shared disk (his/her own things are stored in his/her own CEPH, including the extension package) 3. Hadoop big data infrastructure (including storage and computing) The second computing layer, mainly including: 1. Local computing in Jupyter 2. Access Hiveserver to query hive tables and support intelligent engine acceleration (Smart, Mr,spark) 3. For large amounts of data, it also supports distributed Spark computing (pySpark is generally used in myUtil integration library). The third service layer includes: 1. Standard Jupyter provides support for JupyterHub, Notebook Server and various kernels. 2. Customized functions include the admin module, which provides authentication, development group, and instance management functions, and is used to apply for or map resources at the computing layer and physical layer. 2. Aiming at collaborative development, the fourth application layer of sharing and cloning services, project management and task collaboration, version management and DIff and other functions are developed by means of Jupyter plug-in, corresponding to the visible front-end interface: 1. On the management side, you can manage instances and view public and shared notebook files. 1. The development side, which supports lab and Tree modes, is the main entrance to support data development, query, processing, analysis, mining and decision-making. Based on the Notebook architecture, it provides the following functions and features for users' data operations: 1. Data development, and Hadoop deep integration, based on HivesQL can be data processing. 2. Data query, support a variety of query methods, for Python support the use of PyHive extension package, MyUtil custom class library, Beeline in terminal for data query, Hiveserver support engine acceleration. 3. Data processing: Numpy Pandas and other data processing extensions are installed by default. In addition to numpy Pandas, it also supports statistical and visual extensions such as Scipy Matplotlib. Data mining, sklearn Statsmodels, etc., supports distributed PySpark computing for large-scale data computing. 6. Data decision. Notebook supports various content forms such as code Markdown raw, and can share report conclusions, analysis scripts and analysis results. Other functions, aiming at ks local needs, mainly include collaborative development and stable computing. We provide sharing and cloning, project management and task collaboration, version management and diff, resource monitoring and other functionsCopy the code

<3> How to use ks notebook

If you want to use the Notebook tool, you should have some Python background. We also provide some demo code to help you get started. Screenshots of the usage process and operation are as follows:Copy the code

1. Open an instance: Access the notebook home page and apply for an instance. When applying, you need to specify the development group and select the corresponding package. At present, there are two packages of 10G and 20G. Students who need big data computing will be provided with 50G packages in a whitelist. The ceph shared storage stores all notebook files, configuration files, and personal installation packages. Therefore, the ceph shared storage does not affect the use of the ceph shared storage. Development environment: Click the instance entry to start the development environment, which currently supports lab and Tree development modes. Lab development mode supports multiple Windows (i.e., multiple Windows at the same time, like Linux), more flexible plug-ins and interactions, and notebook users are more comfortable with it. The development environment supports multiple kernels such as PYTHON2 PYTHon3 R PySpark. Novice guide: In the /home/demo directory, we provide some basic notebook files that you can refer to and copy. In the share mode, we also have demo sharing links that you can directly clone to your personal directory. In the future, we will open some notebook files on the Notebook portal. You can also quickly locate and share your Notebook 4. As mentioned earlier, in order to ensure that the instance collection does not affect the use of our tripartite libraries are installed on the ceph shared disk of the personal directory, not the local docker directory. Common third-party library platforms are already installed by default and can be viewed via PIP List. If a third party library does not exist, users can quickly install the missing third party library through PIP Install. For developers, the PIP List checks to see which extensions are already installed and if not, install them yourself. 5. Specific development: The subsequent data development, processing, analysis and mining can be coded in the customary development mode (I prefer lab mode rather than Tree mode). At the end of the teaching video, we will also provide you with three teaching cases of data development, data analysis and data miningCopy the code

2/ Platform use

From instance application, to familiar with the development environment, novice guidance and demo, how to install the extension package (third-party library), how to query data, process files, process data, conduct data analysis, visualization, modeling, report making and sharing, and how to carry out collaborative development (support project management and version management), Notebook demo is provided in the instructions for viewing and cloning (secondary development can be based on the provided template).Copy the code

<1> Application instance

Specify the name of the instance, its resource package, and the development group to which it belongs.Copy the code

<2> Familiar with the development environment

Through the instance portal, you enter your own instance (that is, your own development environment) and enter Lab mode by default. There are two development modes, JupyterLab mode and JupyterTree mode. The LAB mode is as follows:Copy the code

The tree mode is as follows:Copy the code

I personally prefer JupyterLab development mode, because lab mode support more open.Copy the code

<3> Novice boot and demo examples

The notebook and Terminal development modes are supported. The current development modes are as follows: Select a directory (personal directory or project directory, make sure you have write permission make sure you have write permission) and click on the upper left corner + to start the Launcher panel. In the launcher panel, select the development mode you want to use, as shown in the picture below:Copy the code

If notebook development is selected, python2 PYTHon3 R PySpark spark5 kernels are supported. I'm using python3, and python3.7 has some third-party libraries installed by default (extension packs), You can use the pip3 list command to check which extension packages (third-party libraries) have been installed. If the third party library you need is not installed, you can install it yourself by using Pip3 Install XXX. The default third-party library installation path is different from your own third-party library installation path. For example, if pip3 show pandas is in the root directory, it is already installed.Copy the code

For example, pip3 show loguru, the display is in the personal directory. Local, indicating that this is a self-installed, is in their own ceph, even if the collection does not affect.Copy the code

In addition, in the process of development, if you do not know what, you can go to /demo directory to find the corresponding code demo instance view. For example, basic_operation.ipynb, Hellonotebook. ipynb and other demo code. 1. Run shell commands on the notebook, such as ls, PIP install, PIP list, etc. The ipython magic command can be used to check the running time, visual processing, etc., note the use of % or %%. 3. The multi-view, Jupyterlab mode supports the sidebar, flexible layout of the workspace. You can have multiple notebooks open at the same time or share a viewable area with multiple notebooks. You can change the page layout by dragging and dropping the notebook title. File processing, through the file tools provided by Jupyter can upload small files, and read, for specific examples please refer to code 5. For other operations, please refer to the code directlyCopy the code

<4> Introduction of each kernel

The JupyterLab integration supports a variety of Kernels. The kernel I understand is the compiler. Like the interpreter in PyCharm. Different code needs a language to interpret and execute it.Copy the code

The Kernel: - python3.7, Including pyspark using see [pyspark use help] (https://wiki.corp.kuaishou.com/pages/viewpage.action?pageId=126320287) - python2.7. Including pyspark using see [pyspark use help] (https://wiki.corp.kuaishou.com/pages/viewpage.action?pageId=126320287) - R help document [R The kernel to use help] (https://wiki.corp.kuaishou.com/pages/viewpage.action?pageId=846068742) - spark (scala) - help document does not support the east and Singapore [spark (scala) kernel use help] (https://wiki.corp.kuaishou.com/pages/viewpage.action?pageId=846068700), the Console: - python3.7 - python2.7 -r-spark (Scala) User-defined kernel: Jupyter supports customized kernel installation by users. Self-installed kernel is bound to users. Do not drift with instance \ See the help documentation [create a custom kernel through conda] (http://zt.corp.kuaishou.com/user/document/detail?serviceId=103&portalType=1&docId=3015)Copy the code

<5> Installation of python R PySpark third-party libraries (extension packages)

The first step in using Jupyter for Notebook developers is to check the list of installed packages and install your own dependencies, because third-party libraries are crucial for developers. Note: Do not pass in the kernel (notebook)! PIP install XXX/PIP show XXX This command is used to perform various PIP operations. The Python version of Jupyter is installed, not the Python version running in the kernel. Therefore, this command is invalid. Python uses the PIP tool to manage extension packages (third-party libraries). Python2/3 environments have the PIP tool installed by default. If the kernel you choose is python2, the corresponding is PIP; If the kernel you choose is Python3, the corresponding kernel is PIP3. You can view the PIP version by using the PIP --version command: PIP --version, showing the PIP version python3.7 installed by default pip3 --version, showing the PIP version python3.7 installed by defaultCopy the code

Before you install a PIP package, first check which version of Python the extension package supports. Not all packages support both PYTHon2 and Python3. Pip3 install XXX can only be used in Python2.7 kernel(Notebook). Pip3 show XXXX displays information about the XXX extension package, such as the installed version, in the Python3.7 kernel(notebook). Pip3 install XXX == version number pip3 install XXX == version number pip3 install XXX == version number # Assuming you already have an extension installed and you need to upgrade it, you can do the following. If you do not specify the version number, the default is to upgrade to the latest version. Pip3 install --upgrade Django==3.0.4 pip3 install --upgrade Django==3.0.4 pip3 install --upgrade Django Uninstall Django # install PIP third-party source (PYPI) PIP download extensions use company source by default; If the download speed of the extension package is too slow, or the company does not have relevant packages, you can temporarily specify download sources such as: Pip3 install XXXXXX -i https://pypi.douban.com/simple commonly used domestic source list * * * * company endogenous https://pypi.corp.kuaishou.com/kuaishou/prod/+simple/ douban (douban) https://pypi.douban.com/simple/ ali cloud https://mirrors.aliyun.com/pypi/simple/ https://pypi.mirrors.ustc.edu.cn/simple/ China university of science and technology of tsinghua university https://pypi.tuna.tsinghua.edu.cn/simple/ PIP environmental audits: PIP before use, please check list PIP environment is configured correctly \ inspection methods: Open terminal, execute which PIP, which pip3, [root@bjzyx-c448 cron]# which pip2 alias pip2='/root/miniconda2/bin/ PIP 'root/miniconda2/bin/ PIP [root@bjzyx-c448 cron]# which pip3 alias pip3='/root/miniconda3/bin/ PIP ' Please complete the following steps to operate cp/home/public/tools/bashrc ~ / source ~ /. Bashrc remind: you don't manual upgrade PIP, installed, suggest uninstall, unloading way is: uninstall pip2: Python2.7 -M PIP uninstall PIP uninstall PIp3: python3.7 -M PIP uninstall PIP conda Conda is an open source package management system and environment management system for installing multiple versions of software packages and their dependencies and switching between them easily. Conda was created for Python programs for Linux, OS X, and Windows, as well as for packaging and distributing other software. Conda directly installs compiled packages, avoiding a number of environmental dependencies that PIP source compilation and installation can encounter, and is designed to complement PIP. - Conda create --name py37 python=3.7 - Conda activate py37 - Exit the environment and return to the main environment: Conda deactivate Py37 - Delete the environment: conda remove --name py37 --all - View all the environment in the system: conda info -e - Install package: Conda install numpy - View installed packages: conda list - View installed packages for an environment: conda list -n py37 - Search for package information: conda search numpy - Install packages to the specified environment: Conda install -n py37 numpy (if the environment name is not specified by -n, it will be installed in the active environment; -update package: conda update -n py37 numpy -delete package: conda remove -n py37 numpy -update conda: Conda update anaconda: conda update anaconda: Conda update Python (Assuming the current environment is Python 3.5, Conda will upgrade Python to the current latest version of the 3.5.x series)Copy the code

< 6 > code block

The Notebook has some common codes. With the code block function, the Notebook can quickly reuse codes, improve coding efficiency, and promote excellent coding in the company and team. There are public code blocks; Code blocks of the same development group; Your own code block. We can only edit/modify/delete our own code blocks, public and developer blocks can only be viewed, not edited, unless we have permission. Code block includes two parts: 1. Open code block: published by the platform, including common codes extracted from common coding templates such as data query, data processing, data visualization, data modeling and applications. Custom code block: user custom blocks of code (for example, the initialization code, common custom functions, commonly used SQL, etc.), and allows you to set the visible range (private, team sharing, publicly) that is to say, in order to avoid a repeat of the wheel, platform will provide some code block, then the individual also can provide some good block of code, and then for all to use.Copy the code

<7> Data query and examples

The notebook provides the following methods for querying Hive tables: 1. Use the encapsulated MyUtil library to query data. 2. Query data using PyHive extension package 3. Query data using Beeline method in terminals 1) Myutil query: this method is widely used. 1. Query_hive, data query support cache, print log, maximum number of returned rows, intelligent engine, interrupt flag, return list 2. Execute_hive, normal query or query setting initialization, import external UDF, etc. 3. Define_table, create table, Support overwrite switch 4. Lazy_query, data query feedback dataframe, support pre-execution, return column setting, sequencing, maximum return row, intelligent engine, interrupt identification, whether to use cache, etc. 5. Save_df_to_hive, Dataframe = get_sparksession; pyspark = query_druid; druID = query_clickhouse; Connect_hdfs: HDFS 10.clear() to clear the cache of SQL 11.connect() to reconnect if the connection fails 2) PyHive: Pyhive is rarely used to query data. To query data using the PyHive extension package, see /home/demo/data_query.ipynb import PyHiveCopy the code

3) Beeline To query data on the terminal, use beeline to query data. For details, see the user guide provided by the Hiveserver group. Beeline using document reference 4 https://wiki.corp.kuaishou.com/pages/viewpage.action?pageId=384732325 pyspark data query pyspark support distributed data query, For an example, visit /home/demo/pyspark_demo.ipynbCopy the code

Start spark session. 3. Spark.sql () reads data directly and returns dataframe formatCopy the code

<8> File upload and download

When you upload your local files to your cepH, you can download them directly from cepH by clicking the up arrow on the left. 2) The operation of HDFS files such as PySpark and other kernels may need to upload third-party class libraries during calculation. (Currently, pySpark Kernel is no longer used, python3 Kernel is directly used. You can then upload files (using the local Streamlit extension package upload as an example) using PySpark, the MyUtil integrated library: Step1: local file location and package pip3 show streamlit # know where streamlit CD/home/hezan/local/lib/python3.7 / site - packages/streamlit zip - r Streamlit. zip = / TMP /streamlit.zip = / TMP /streamlit.zip / home/hadoop/software/hadoop/bin/hadoop fs - put. / streamlit. Zip viewfs: / / hadoop - lt - cluster/home/HDP/lzw_debug (make sure they have write access) / home/hadoop/software/hadoop/bin/hadoop fs - ls viewfs: / / hadoop - lt - cluster/home/HDP/lzw_debug (see the uploaded files) use # # 3) Myutil.connect_hdfs () performs file processingCopy the code
    Run HDFS in python kernel
    # Input parameter: none \
    # return: HadoopFileSystem object \
    # Note: This function is bound to the developer account. Please select the correct developer before using it

    # Reference code

    from myutil import connect_hdfs

    #the hs host is:nginx-airflow-hiveserver2.internal:10001
    #the group_id is:98
    #the hs host is:bjlt-rs238.sy:20000
    #the group_id is:98
    # myutil 1.1.6

    fs = connect_hdfs()
    The HDFS user corresponding to the current user group is dp. If the user group is not the target user group, select the correct user group on the menu bar

    # View information
    print(fs.df())
    print(fs.get_capacity())
    print(fs.get_space_used())
    print(fs.info("/user/dp/jupyter_snapshot/"))
    print(fs.disk_usage("/user/dp/jupyter_snapshot/"))

    Create folder
    fs.mkdir("/user/dp/jupyter_snapshot/test/")
    print(fs.isdir("/user/dp/jupyter_snapshot/test/"))
    print(fs.ls("/user/dp/jupyter_snapshot/test/"))

    # upload file
    with open("/home/gaozhenmin/test.txt"."rb") as f:
        if fs.isfile("/user/dp/jupyter_snapshot/test/test1.txt"):
            fs.rm("/user/dp/jupyter_snapshot/test/test1.txt")
        fs.upload("/user/dp/jupyter_snapshot/test/test1.txt", f)
    print(fs.info("/user/dp/jupyter_snapshot/test/test1.txt"))
    #fs.chmod("/user/dp/jupyter_snapshot/test/test1.txt",777)
    print(fs.info("/user/dp/jupyter_snapshot/test/test1.txt"))
    #fs.chown("/user/dp/jupyter_snapshot/test/test1.txt","gaozhenmin","dp")
    print(fs.info("/user/dp/jupyter_snapshot/test/test1.txt"))
    print(fs.stat("/user/dp/jupyter_snapshot/test/test1.txt"))

    # delete file
    fs.delete("/user/dp/jupyter_snapshot/test/test1.txt")
    # write files
    with fs.open("/user/dp/jupyter_snapshot/test/test2.txt".'wb') as f:
        f.write("test")

    # to read file
    with fs.open("/user/dp/jupyter_snapshot/test/test2.txt".'rb') as f:
        content = f.read()
        print content
    print fs.exists("/user/dp/jupyter_snapshot/test/test2.txt")
    # rename
    fs.rename("/user/dp/jupyter_snapshot/test/test2.txt"."/user/dp/jupyter_snapshot/test/test3.txt")

    # Download file
    with open("/home/gaozhenmin/test3.txt"."wb") as f:
        fs.download("/user/dp/jupyter_snapshot/test/test3.txt", f)
    print(fs.ls("/user/dp/jupyter_snapshot/test/"))
    fs.delete("/user/dp/jupyter_snapshot/test/test3.txt")
    print(fs.isfile("/user/dp/jupyter_snapshot/test/test3.txt"))

    # delete folder
    print(fs.ls("/user/dp/jupyter_snapshot/test/"))
    fs.delete("/user/dp/jupyter_snapshot/test/")
    print(fs.isdir("/user/dp/jupyter_snapshot/test/"))
Copy the code

<9> Data processing and examples

Pd.read_csv ().pd.read_excel (); pyHive (); myUtil (); Obtain data through API interface 2) View data including the following attributes or methods: 1. Shape View data outline, several rows and several columns 2. 4. Columns Obtain the column index. 5.T transpose; 6. Values retrieves the value index. 7. Get quick statistics, basic statistics of a piece of data Df [column][index] or df.loc[indexs,columns] Df.loc [df.column == value] or df.query(condition) 1. Data addition and deletion, you can add data by adding column index and directly assigning value method; Delete or rearrange data use; Df [column] = value # Add column \ > pd.concat([df,df1]) # concat or add line \ > df.drop([index]) # delete line \ > df.drop([column],axis=1) # Delete column \ > df.drop_duplicates() # duplicate data 1. For index, row in df.iterrows(): # for loop \ > df. Loc [index, 'client_date] = row [' client_time]. The split ('') [0] / > / > df = [' client_date2 '] Df ['client_time']. Apply (lambda x: x.spit (' ')[0]) # apply lambda method 1. Df2 = df. Groupby (df['product']) #groupby DataFrameGroupBy Describe ()\ > df['user_name'].groupby(df['product']).count() # To_datatime () > df['client_time'] = pd.to_datetime(df['client_time']) 4) Write data to the hive tableCopy the code
    Myutil save_df_to_hive() writes DataFrame data objects to the specified hive table.
    The save_df_to_hive() method takes the following parameters:
        #1. Dataframe data object
        #1. Proxy_user development group
        # 1. The name and the name of the table
        Save_format Saves the format parquet by default
        # 1. Partition_cols column name
        #1. Mode mode includes append Overwrite Ignore error

    Example code:
    import myutil
    import pandas
    import numpy

    Create a dataframe data object
    s = [[2012.12.0], [2012.13.1], [2020.14.0], [2020.15.1]]
    data_df = pandas.DataFrame(numpy.array(s),columns=['year'.'count'.'state'])
    
    Write dataframe to hive table
    myutil.save_df_to_hive(data_df,
                           proxy_user="hdp",
                           name="ks_hdp_dev.jupyter_gao1")

    Test whether the write was successful
    sql_command = u"select * from ks_hdp_dev.jupyter_gao1"
    rows = myutil.query_hive(sql_command,cache=False)  List [[],[],[],[]......
    print(rows)

    #2020-04-25 16:58:49 start writting hive table
    #2020-04-25 16:58:51 finish writting hive table
    #done, fetched rows:4
    #saving to cache file: /home/public/.hiveutil/cache/17/17ce92fc811b6c77f4c80258d67cf3f75880f925
    #saving to cache file: /home/public/.hiveutil/cache/17/17ce92fc811b6c77f4c80258d67cf3f75880f925_md5
    # [(2012, 12, 0), (2012, 13, 1), (2020, 14, 0), (2020, 15, 1)]
Copy the code

<10> Data analysis and examples

Step2: data analysis and execution [1. Data query] [2. Data processing] [3. Index calculation] [4. Data visualization] [5. PivotTable]Copy the code

<11> Data visualization and examples

Steps of data visualization: 1) data processing and understanding 2) chart selection, and which ICONS are most suitable for visualization 3) chart making and attribute setting (attributes here include title, coordinate axis, label attribute and subgraph setting) common graph and realization line graph: Histogram: The index in the Dataframe or the non-duplicate dimension column is X-axis, the measure column is Y-axis pie chart: The Dataframe performs proportion analysis according to the set metric scatter chart: Select two metrics from Dataframe and set them as parameters X and y histogram: Area plot (measure 1+ dimension 1+) : Same-line graph, with fill, requiring both positive and negative measurements. Boxplot (Measure 1+) : Calculation of the distribution of measurements in a Dataframe, measuring the X-axis, measuring the Y-axis.  matplotlib pandas seaborn plotly cufflinksCopy the code
   # Common diagrams and implementations
   See the accompanying Notebook demo for examples of common diagrams:
   In the following example, you can replace the complexity of the diagram implementation code of different extension packages
   
   # line
   import matplotlib.pyplot as plt
   from matplotlib.font_manager import FontProperties
   zhfont1 = FontProperties(fname="/home/public/font/SimHei.ttf", size=14) 
   plt.title("product uvs") 
   plt.xlabel("Date of visit", fontproperties=zhfont1)  
   plt.xticks(rotation=30)
   plt.ylabel("uv") 
   plt.plot(uv_df) # DataFrame Graph
   plt.show()
   

   # pandas code:

   # Pandas Supports object understanding and is an object-oriented method for selecting appropriate graphics
   import pandas as pd
   import numpy as np
   data = pd.DataFrame(
       np.random.randn(100.4),
       index=np.arange(100),
       columns=list("ABCD"))# API supports object understanding (declarative drawing, object oriented)
   #data.plot.line() # draw a line chart (index or non-repeating dimension column as x axis, measure column as y axis)
   #data.plot()
   #data[['A']].plot(kind='line')
   data[['A']].plot.line()

   # seaborn
   import matplotlib.pyplot as plt
   import seaborn as sn
   import pandas as pd
   sn.lineplot( data=uv_df)
   plt.xticks(rotation=90) # Horizontal rotation to avoid axis stacking
   plt.show()
   

   # cufflinks code:

   import pandas as pd
   import cufflinks as cf
   import numpy as np

   cf.set_config_file(offline=True)

   # specified Dataframe
   cf_data = pd.DataFrame(
       np.random.randn(100.4),
       index=np.arange(100),
       columns=list("ABCD")
       )
   labels = ['first'.'second'.'third'.'fourth'.'fifth'.'sixth'.'seventh']
   cf_data['label'] = np.random.choice(labels,100)

   # Select Figure and plot the chart using iplot
   cf_data.iplot(kind='line',xTitle='x',yTitle='y',title='Cufflinks Sample diagram ')
   

   # column figure
   # matplotlib/pandas
   # Support dataframe direct plotPuv_df.plot.bar () can also use plt.bar()#seaborn
   
   import matplotlib.pyplot as plt
   import seaborn as sn
   import pandas as pd
   sn.barplot(op_df.index,op_df.pvs)
   plt.xticks(rotation=90) # Horizontal rotation to avoid axis stacking
   plt.show()
   

   #cufflinks

   
   cf_data.iplot(kind='bar')
   

   # the pie chart
   #matplotlib

   import matplotlib.pyplot as plt
   plt.pie(x=op_df.pvs, labels=op_df.index)
   plt.show()
   

   # a scatter diagram
   # matplotlib/Pandas
   
   Scatter (select two metrics x, Y in Dataframe)
   data.plot.scatter(x='A',y='B',color='LightGreen')
   

   # seaborn
   # cufflinks

   # Draw a scatter plot
   cf_data.iplot(kind='scatter',x='A',y='B',mode='markers')

   # histogram
   # Histogram can intuitively show frequency, or probability density,
   # In this example, we count the number of users accessing the platform within a period and display it in a histogram. The results show that the number of users accessing the platform is a compound long-tail distribution

   # matplotlib
   import matplotlib.pyplot as plt
   import pandas as pd
   plt.hist(fre_df)
   plt.show()  Graph shows the compound long tail distribution of user visits
   

   # seaborn
   import matplotlib.pyplot as plt
   import seaborn as sn
   import pandas as pd
   sn.distplot(fre_df, kde=False)#Seaborn # draw the histogram. When kde=False, it is basically the same as the histogram drawn by Matplotlib above
   sn.distplot(fre_df)# When kde=True, the kernel density estimate will be displayed based on the figure above, which can help us estimate the probability density.
   plt.show()
   
   
   # heat map
   # Thermal maps are often used to look at correlations between different variables,
   Here we show the correlation coefficient between daily PV and UV by thermal map, Seaborn example code:
   import matplotlib.pyplot as plt
   import seaborn as sn
   import pandas as pd
   sn.heatmap(puv_df.corr(),vmin=0.5,vmax=1)
   plt.show()
   
Copy the code

<12> Data modeling and examples

Step1: problem definition, defining the problem of analysis and choosing the appropriate method is the most important. step2: 1) Define the model and confirm the method of solving the modeling according to the defined problem: supervised learning, unsupervised learning: Common classification, regression and other model methods unsupervised learning: The common methods are clustering, dimensionality reduction and so on. 2) Definite data amount and applicable tools. Python extension packages such as Sklearn support model training and prediction with small sample size. Direct use of sklearn expansion pack, [Chinese] sklearn official document (https://sklearn.apachecn.org/) 2. Model training and model evaluation using the k library encapsulated by the company requires distributed processing for model training and prediction with a large sample size. The platform provides the following methods: Pyspark performs model training and prediction, example [PySpark performs LR model training] 2. PyXlearning submits distributed training job, example [pyXlearning runs Xgboost task] Step3: The execution steps of modeling: 1) Data reading myutil and PyHive, etc. 2) data preprocessing 3) Feature processing (extraction) : For modeling and prediction, we use the online data set diabetes prediction and the SKlearn extension package for regression prediction. The regression coefficient (weight, weight of each feature) was calculated, and the effect of the model was evaluated by MSE indicators.Copy the code
    import matplotlib.pyplot as plt
    import numpy as np
    from sklearn import datasets, linear_model  
    Linear_model is a linear model
    from sklearn.metrics import mean_squared_error, r2_score
    # mean_squared_error, r2_score are two methods to evaluate the merits of the model, which are the sum of error squares and R square respectively

    # Load the diabetes dataset
    diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)

    # Use only one feature
    diabetes_X = diabetes_X[:, np.newaxis, 2]

    # Split the data into training/testing sets
    diabetes_X_train = diabetes_X[:-20]  # training set
    diabetes_X_test = diabetes_X[-20:] # test set

    # Split the targets into training/testing sets
    diabetes_y_train = diabetes_y[:-20]  # training set
    diabetes_y_test = diabetes_y[-20:]   # test set

    # Create linear regression object
    regr = linear_model.LinearRegression()  # Linear regression

    # Train the model using the training sets
    regr.fit(diabetes_X_train, diabetes_y_train)  # Start training

    # Make predictions using the testing set
    diabetes_y_pred = regr.predict(diabetes_X_test) # Start predicting

    # The coefficients
    print('Coefficients: \n', regr.coef_)
    # The mean squared error, calculate mSE
    print('Mean squared error: %.2f'
          % mean_squared_error(diabetes_y_test, diabetes_y_pred))
    # The coefficient of determination: 1 is perfect prediction
    print('Coefficient of determination: %.2f'
          % r2_score(diabetes_y_test, diabetes_y_pred))  # compute r squared

    Outputs the output of Plot
    plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
    plt.plot(diabetes_X_test, diabetes_y_pred, color='blue', linewidth=3)

    plt.xticks(())
    plt.yticks(())

    plt.show()
Copy the code
5) Model evaluation Different models have corresponding evaluation indicators. For classification models, accuracy rate, recall rate, AUC value and F1 value are commonly used to evaluate the model effect. For regression model, mSE, R square and other evaluation methods are commonly used.Copy the code

<13> Data report preparation and sharing

Notebook reporting capabilities

1.Markdown and Code hybrid Notebook not only provides code writing and running, but also supports rich Markdown text capabilities. Complete report writing capability can be achieved through code input, result Output (including returned results, visual chart effects, etc.), and related specification markdown. 2. Markdown supports various text forms such as titles, lists, links, pictures, and formulas. That is to say, each cell in the notebook can write not only code, but also text (RAW) and Markdown. 2. Slides Settings and publishing (currently not supported by Notebook) Select Slides Show from the View-Cell toolbar. Slide types are set for cells, including slide, sub-slide, and fragment. Other types are not shown in PPT. Release Slides: Use the Nbconvert service to release slides (current Cloud Jupyter mode does not support). After the notebook is handled, both the code and the report can be exported as an offline file. The notebook can be exported in HTML, TOC, and PDF. After being exported, it can be shared and reported. In other words: export is a report.Copy the code

Ii. How to share the Notebook report

1. Native sharing Capability You can export files to generate offline files and share them. 2. Customized sharing Function In the process of Notebook localization, we have customized development for sharing requirements, supporting file sharing, share Settings, share list, one-click cloning and other functions. [share function instructions] (http://zt.corp.kuaishou.com/user/document/detail?serviceId=103&portalType=1&docId=2167#_6). 3. For better knowledge sharing and knowledge retrieval, Notebook has developed a synchronous wiki feature. On the toolbar, choose: Synchronize Wiki button 2. Select the space to be imported. You can import personal space and team space (Note: For wiki authentication, use the Jupyter account to import documents. Add new team space under the list of team space please submit through the feedback of Jupyter) 3. 4. Kim Notification & Jupyter Platform Notification (The import link will appear in the upper right corner of the notebook after the import is successful)Copy the code

<14> Collaboratively developed with Notebook

Collaborative development here includes sharing and cloning. The Notebook computing platform originally developed the sharing feature for Notebook Review and output report sharing, and has since provided the one-click cloning feature for secondary development.Copy the code

<15> Automatic code completion

Currently, in lab development mode, enter "after PD. Then press TAB to display automatic code completion (the initial startup time may be a little longer)Copy the code

3/ Platform introduction

<1> PySpark big data processing

The Notebook computing platform has two computing capabilities: Local computing (10G, 20G, and 50G packages are available) and distributed computing (submitting Spark jobs to the Spark engine for calculation using the Python kernel and the integrated myutil method. This is a new method, and you are advised to use it.) Create spark session using myutil 1. Spark Read Hive data 1. Local processing of spark dataframe 1. Submit the Spark ML jobCopy the code