Introduction to Jupyter Notebook

The original link: www.dataquest.io/blog/jupyte…

Jupyter Notebook is a powerful interactive and presentation tool for data science projects. It can be used as a development document, including code, explanatory text, code results, mathematical formulas, etc. It is a very powerful and popular tool.

This article will introduce the use of Jupyter Notebook through a simple data analysis example. The example here is given the data of 500 Companies in the United States for a total of 50 years since it was published in 1955. The task is to analyze the profit change process of these companies.

1. Install

Anaconda is the most widely used environment management tool, and it preinstalls many common third-party libraries, including Numpy, pandas, matplotlib, and so on.

For more information on Anaconda, check out the introduction and environment configuration of the Python Basics article published on the public account.

In addition to installing through Anaconda, you can also use PIP directly

pip install jupyter
Copy the code

2. Create your first Notebook

This section describes how to run and keep notebooks, familiarize yourself with the structure and interactive interface of Jupyter Notebook. We will use an example to familiarize ourselves with some of the cores and get a better understanding of how to use Jupyter Notebook.

Run Jupyter

On Windows, you can run Jupyter using a shortcut added to the Start menu, or you can start by typing Jupyter Notebook on the command line. A new window will open in the default browser, as follows:

This is not a Notebook yet. It is the Notebook management interface that manages all Notebooks of the current folder.

Note that only the files and folders in the folder where Jupyter is running are shown here, which is the folder where The Jupyter notebook is running on the command line. This can also be changed by running the command to specify the folder location, i.e. :

jupyter notebook filepath
Copy the code

In addition, here is similar to the URL on the browser http://localhost:8888/tree, said the localhost is local address, and then is port 8888.

The next step is to create a New notebook, as shown in the following image. In the upper right corner of the management screen, click on the New menu to select Python 3 (or any other version). The default name is untitled.ipynb.

Ipynb file

Each ipynb file describes the contents of the notebook in JSON format, including each unit and its contents, in metadata format. After opening the ipynb file, choose Edit->Edit Notebook Metadata to Edit it.

Notebook interface

The new Notebook interface is shown in the following figure.

There are two terms to know here — Cells and Kernels, both very important terms:

kernel: Represents a computing engine that executes code blocks in the notebook
cell: unit block, used to display text or code.

Unit (Cells)

Cells is the main part of the notebook. There are two main Cells that are similar to each other:

code cell: includes the code that needs to be executed and the result of its execution
Markdown cell: contains text in Markdown format and the result of its execution

This is a Markdown cell. This is the result of executing the cell. If you want to execute a cell, you can click the Run button. The first one is to import the third-party library, and the second one is to print a paragraph and the result of running it.

Note that the left side of the code cell will have a tag In [1], where the number indicates the order In which the code block will be run. In the notebook, if the block is not executed, it will be In []. Is In [1], if run again, In [2], and so on, the same code block executed multiple times, this number will also change. And In is short for Input. If the code block takes a little longer to run, it displays In [*], indicating that the current code block is running.

For notebook, you can display the value of the variable or the return value of the function without calling print, as shown below, which prints only the last line of the current cell.

One other thing to note is that for a cell, the bounding box is green if it is being edited, and blue at runtime. The two modes are shown here, green for edit mode and blue for command mode.

shortcuts

Notebook has a number of Shortcuts, which can be viewed from the Help->Keyboard Shortcuts menu or directly from Ctrl+Shift+P Shortcuts. Here are some shortcuts:

Edit mode and command mode are availableEsc 和 EnterTo convert, usually pressEscEnter command mode,EnterEnter edit mode

In command mode:

incellUse the up and down arrow to browse between, orUp 和 Down 键
ARepresents at presentcellInsert a new one at the topcellAnd theBI insert a new one underneathcell
MSaid toMarkdown cellAnd theYIs means to becomecode cell
Press twice in successionDYes Delete currentcell
ZYes Undo operation
ShiftaddUporDownYou can select more than one at a timecells, then adoptShift + MCan merge multiplecells

Markdown

Markdown is a lightweight and easy-to-learn markup language for formatting text. It has a syntax similar to HTML and is a very helpful language for adding comments or adding images.

Try typing the following text in the Jupyter Notebook, remember it’s in the Markdown cell:

# This is a level 1 heading
## This is a level 2 headingThis is some plain text that forms a paragraph. Add emphasis via **bold** and __bold__, or *italic* and _italic_. Paragraphs must be separated by an empty line. * Sometimes we want to include lists. * Which can be indented. 1. Lists can also be numbered. 2. For ordered lists. [It is possible to include hyperlinks](https://www.example.com) Inline code uses single backticks: `foo()`, and code blocks use triple backticks: ``` bar() ``` Or can be indented by 4 spaces: foo() And finally, adding images is easy: ! [Alt text](https://www.example.com/image.jpg)Copy the code

The result is shown below:

If you want to add images, there are three ways to do it:

Using the picture on the network, add the network link URL, such as the above example is this kind of practice, the URL is www.example.com/image.jpg
With a local URL, images can only be used in that notebook, such as in the same Git repository
The menu bar selects “Edit->Insert Image”, which converts the Image to a string and stores it in the.ipynbThis practice will increase in documentationipynbFile size

How to use Markdown can be found in the official tutorial of its creator, John Gruber:

Daringfireball.net/projects/ma…

Kernels

Each notebook has a kernel. When executing code within a cell, the kernel is used to run the code and display the resulting output within the cell. At the same time, the state of kernel will be retained and not limited to one unit, that is, variables in one unit or imported third-party libraries can also be used in another unit and are not independent of each other.

In a way, the notebook can be thought of as a script file, except that it has more input options such as captions, images, and so on.

A code example is also used to illustrate the kernel feature, as shown in the figure below. Input two pieces of code respectively in two cells. In the first cell, import NUMpy and define the function square(), while in the second cell, call the function square() and run the result successfully.

Most of the time we run each unit of code from the top down, but that’s not always the case. In fact, we can go back to any cell and execute the code again, so the In [] to the left of each cell is useful because it tells us what cell it was running on.

In addition, we can rerun the entire kernel. Here are some options in the menu kernel:

RestartRestart the kernel, which clears all variable definitions in the notebook
Restart & Clear Output: Same as the first option, but also clears all output
Restart & Run All: restarts, and automatically runs all the code in the cell from scratch

The Interupt option is usually used if the kernel is stuck in a unit of code and you want to stop the code.

Select a kernel

The Kernel menu also provides an option to replace the Kernel. When creating a notebook, you select one Kernel. Of course, whether you can select other kernels depends on whether you have installed the Python version. If you have python3.6 and Python2.7 installed, there are two options. In addition to Python, Juypter Notebook supports more than 100 other kernel languages, such as Java, C, R, Julia, and more.

3. Examples of data analysis

Start now with the data analysis example from the beginning of the article, which is to analyze the change of a company’s profit from a company’s wealth data.

Named notebooks

Heading into the notebook, please click on the file name at the top of the screen. If the file name is not named, that is untitle. ipynb. You can also go back to the management page and name the notebook, as shown below. After selecting the notebook, a line of options will appear above it, including:

DuplicateCopy:
Shutdown: Stops the kernel of the notebook
View: View the notebook content
EditTo edit themetadatacontent

There is also an option to delete files.

Note that closing the notebook screen does not Shutdown the notebook kernel. It will always run in the background. If the notebook is green in the management screen, it is running. Or the command to shut down Jupyter Notebook from the command line.

The preparatory work

Start by importing some of the third-party libraries you need:

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
Copy the code

Pandas is used to manipulate data, Matplotlib is used to draw diagrams, and Seaborn is used to make diagrams beautiful. Typically, you’ll also need to import Numpy, but in this case we’ll use it through PANDAS. In addition, %matplotlib inline this is not a Python command. It is a magic command unique to Jupyter that allows Jupyter to capture matplotlib images and render them in unit output.

Then we read the data:

df = pd.read_csv('fortune500.csv')
Copy the code

Save and checkpoint

Before you start, remember to save files on a regular basis. This can be done directly by using the shortcut Ctrl + S. This is done with a command called “Save and checkpoint”.

Every time you create a new notebook, you create a checkpoint file, which is stored in a hidden subfolder.ipynb_checkpoints, which is also an.ipynb file. By default, Jupyter automatically saves the notebook contents to the checkpoint file every 120 seconds, and updates the notebook and checkpoint files when you manually save them. This is a File that you can Revert to if you shut down the notebook for unexpected reasons. You can Revert to File->Revert to Checkpoint.

Exploration of data sets

The first five lines and the last five lines of the DataFrame will be displayed in the text library. The text structure is called “DataFrame”. The text structure is called “DataFrame”.

df.head()
df.tail()
Copy the code

The output is as follows:

By looking at it, we know that each row is a company’s data for a given year, and then there are five columns, representing year, ranking, company name, revenue, and profit.

Next, we can rename the column for convenience:

df.columns = ['year'.'rank'.'company'.'revenue'.'profit']
Copy the code

Then, you can also view the amount of data, as shown below:

len(df)
Copy the code

As you can see in the chart below, there are 25,500 pieces of data, which is exactly how much data 500 companies had from 1955 to 2005.

Next, we check to see if the dataset looks like the one we want to import. An easy way to check is to see if the data type is correct:

The profit data type is Object instead of float64, which is the same as revenue. This means that it may contain some non-numeric values, so we need to check:

The output shows that there are indeed non-integer values, but N.A, and then we need to determine whether to include other types of values:

The output indicates that only N.A., so how to handle the missing case depends on how many rows are missing.

369 pieces of data were missing, or about 1.5% of the total 25,500 pieces. If the missing data is normally distributed over the years, the easiest way is to simply delete the data set, as shown below

According to the results, the year with the most missing data was less than 25, accounting for 4% at most compared with 500 data per year. Only in the 1990s, the missing data was more than 20, and in the rest of the years, the missing data was less than 10. Therefore, it is acceptable to directly delete the missing data, with the code as follows:

After deleting the data, Profit is of type Float64.

Simple data exploration is done, and the diagram is drawn.

Use Matplotlib to chart

Firstly, the average income statement with the change of years is drawn, and the change of income with the change of years is also drawn, as shown in the figure below:

It looks a little bit like exponential growth, but there were two huge drops, and that was actually related to what was happening at the time, and then you can see what happened to income, as you can see in the chart below:

From the perspective of revenue, the change did not occur twice as profit fluctuations.

Reference stackoverflow.com/a/47582329/…

The results showed that there was a wide range of companies, with billions of dollars in revenue and billions of dollars in losses.

There’s a lot more to explore, but the examples so far are enough to get you started on Jupyter Notebook. This part shows how to analyze and chart exploration data.

reference

Markdown:www.markdownguide.org/getting-sta…
Stackoverflow.com/a/47582329/…

Finally, the code and data of this article can be public number background reply “Jupyter” to obtain the link address.

Welcome to follow my wechat official account — the growth of algorithmic ape, or scan the QR code below, we can communicate, learn and progress together!