The original link https://jinkey.ai/post/tech/xiang-mian-fei-yong-gu-ge-zi-yuan-xun-lian-shen-jing-wang-luo-colab-xiang-xi-shi-yong-jiao-c This article is written by Jinkey (wechat public account Jinkey-love, official website https://jinkey.ai). This article is allowed to be reproduced without tampering with the signature. Deleting or modifying this section of copyright information is regarded as infringement of intellectual property rights, and we reserve the right to pursue your legal liability.

1 introduction

Colab is an interactive Python environment for The Google Internal Class Jupyter Notebook. Support for Google bucket (TensorFlow, BigQuery, GoogleDrive, etc.), support for PIP installation of any custom library. Website: https://colab.research.google.com

Installation and use of library

Colab comes with deep learning libraries such as Tensorflow, Matplotlib, Numpy, and Pandas. If you need other dependencies, such as Keras, you can create a new code block and type

#Install the latest version of Keras
# https://keras.io/! pip install keras#Version specific installation! PIP install keras = = 2.0.9#Install OpenCV
# https://opencv.org/! apt-get -qq install -y libsm6 libxext6 && pip install -q -U opencv-python#Install Pytorch
# http://pytorch.org/! PIP install - q torchvision at http://download.pytorch.org/whl/cu75/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl#Install XGBoost
# https://github.com/dmlc/xgboost! pip install -q xgboost#Install 7 zip! apt-get -qq install -y libarchive-dev && pip install -q -U libarchive#Install GraphViz and PyDot! apt-get -qq install -y graphviz && pip install -q pydotCopy the code

3 Google Drive file operations

Authorized to log in

For the same Notebook, you only need to log in once before you can read or write the notebook.

Install the PyDrive operation library. This operation should only be performed once per notebook! pip install -U -q PyDrive from pydrive.auth import GoogleAuth from pydrive.drive import GoogleDrive from google.colab import auth from oauth2client.client import GoogleCredentialsThe first login will be authenticated
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
Copy the code

After executing this code, the following content will be printed. Click Connect to authorize login, obtain the token value and fill in the input box. Press Enter to continue to complete login.

Directory traversal

List all files in the root directory
# "q" tutorial can be found in the query condition: https://developers.google.com/drive/v2/web/search-parameters
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
  print('title: %s, id: %s, mimeType: %s' % (file1['title'], file1['id'], file1["mimeType"]))
Copy the code

You can see the console print

Title: Colab test, id: 1 cb5chksdl26amxq5xrqk2kabv5lskisj8huedyzpeqq, mimeType: application/VND. Google – apps. The document

title: Colab Notebooks, id: 1U9363A12345TP2nSeh2K8FzDKSsKj5Jj, mimeType: application/vnd.google-apps.folder

Where id is the unique identifier for the file to be retrieved for the rest of the tutorial. According to the mimeType Colab can know the test file for the doc document, and Colab Notebooks for folder (namely Colab Notebook storage root directory), if you want to query Colab Notebooks file folder, Query conditions can be written as follows:

# 'directory id' in parents
file_list = drive.ListFile({'q': "'1cB5CHKSdL26AMXQ5xrqk2kaBv5LBkIsJ8HuEDyZpeqQ' in parents and trashed=false"}).GetList()
Copy the code

Reading file contents

At present, the format of the content that can be directly read is.txt (mimeType: text/plain). The reading code is as follows:

file = drive.CreateFile({'id': "Replace it with your.txt file ID"}) 
file.GetContentString()
Copy the code

.csv, on the other hand, prints only the first line of data using GetContentString()

file = drive.CreateFile({'id': "Replace it with your.csv file ID"}) 
The download operation here is cache only. It will not download another file in your Google Drive directory
file.GetContentFile('iris.csv'."text/csv") 

Print the file content directly
with open('iris.csv') as f:
  print f.readlines()
# use pandas to read text
import pandas
pd.read_csv('iris.csv', index_col = [0, 1], skipinitialspace = True)Copy the code

Colab will direct the output in the form of table (below for the interception of the iris data set the first few lines), iris data set address for http://aima.cs.berkeley.edu/data/iris.csv, Students can perform the upload to their Own Google Drive.

Write file operation

Create a text file
uploaded = drive.CreateFile({'title': 'example. TXT'})
uploaded.SetContentString('Test content')
uploaded.Upload()
print('create file id {}'.format(uploaded.get('id')))
Copy the code

More actions are available at http://pythonhosted.org/PyDrive/filemanagement.html

4. Google Sheets spreadsheet operation

Authorized to log in

For the same Notebook, you only need to log in once before you can read or write the notebook.

! pip install --upgrade -q gspread from google.colab import auth auth.authenticate_user() import gspread from oauth2client.client import GoogleCredentials gc = gspread.authorize(GoogleCredentials.get_application_default())Copy the code

read

For the demo, import the iris.csv data to create a Google Sheet file that can be placed in any directory on Google Drive

worksheet = gc.open('iris').sheet1

Get a list [
# [row 1 column 1, row 1 column 2,..., row 1 column n],... ,[row n, column 1, row n, column 2... row n, column n]]
rows = worksheet.get_all_values()
print(rows)

# use pandas to read text
import pandas as pd
pd.DataFrame.from_records(rows)
Copy the code

The printed results are respectively

[[‘ 5.1 ‘, 3.5 ‘ ‘, ‘1.4’, 0.2 ‘ ‘, ‘setosa], [‘ 4.9’, ‘3’, ‘1.4’, 0.2 ‘ ‘, ‘setosa],…

write

sh = gc.create('Google Table')

Open workbooks and worksheets
worksheet = gc.open('Google Table').sheet1
cell_list = worksheet.range('A1:C2')

import random
for cell in cell_list:
  cell.value = random.randint(1, 10)
worksheet.update_cells(cell_list)
Copy the code

5 Download the file to the local PC

with open('example.txt', 'w') as f:

F. rite(' Test contents ')

files.download('example.txt')

Copy the code

6 of actual combat

Here to me in a lot of open source LSTM text classification project master for example https://github.com/Jinkeycode/keras_lstm_chinese_document_classification/data The three files in the directory are stored on Google Drive. This example demonstrates categorizing the headings of health, technology, and design categories.

new

Create a new Python2 notebook on Colab

Install dependencies

! pip install keras ! pip install jieba ! pip install h5py import h5py import jieba as jb import numpy as np import keras as krs import tensorflow as tf from sklearn.preprocessing import LabelEncoderCopy the code

Load the data

Authorized to log in

Install the PyDrive operation library. This operation should only be performed once per notebook! pip install -U -q PyDrive from pydrive.auth import GoogleAuth from pydrive.drive import GoogleDrive from google.colab import auth from oauth2client.client import GoogleCredentials def login_google_drive():The first login will be authenticated
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  return drive
Copy the code

List all files under GD

def list_file(drive):
  file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
  for file1 in file_list:
    print('title: %s, id: %s, mimeType: %s' % (file1['title'], file1['id'], file1["mimeType"]))
    

drive = login_google_drive()
list_file(drive)
Copy the code

Cache data to the work environment

def cache_data():
  # id replaces the corresponding file ID read in the previous step
  health_txt = drive.CreateFile({'id': "117GkBtuuBP3wVjES0X0L4wVF5rp5Cewi"}) 
  tech_txt = drive.CreateFile({'id': "14sDl4520Tpo1MLPydjNBoq-QjqOKk9t6"})
  design_txt = drive.CreateFile({'id': "1J4lndcsjUb8_VfqPcfsDeOoB21bOLea3"})
  The download operation here is cache only. It will not download another file in your Google Drive directory
  
  health_txt.GetContentFile('health.txt'."text/plain")
  tech_txt.GetContentFile('tech.txt'."text/plain")
  design_txt.GetContentFile('design.txt'."text/plain")
  
  print("Cache successful")
  
cache_data()
Copy the code

Read data from the working environment

def load_data():
    titles = []
    print("Loading health category data...")
    with open("health.txt"."r") as f:
        for line in f.readlines():
            titles.append(line.strip())

    print("Loading tech data...")
    with open("tech.txt"."r") as f:
        for line in f.readlines():
            titles.append(line.strip())


    print("Loading design category data...")
    with open("design.txt"."r") as f:
        for line in f.readlines():
            titles.append(line.strip())

    print("There are %s titles loaded." % len(titles))

    return titles
  
titles = load_data()
Copy the code

Load the label

def load_label():
    arr0 = np.zeros(shape=[12000, ])
    arr1 = np.ones(shape=[12000, ])
    arr2 = np.array([2]).repeat(7318)
    target = np.hstack([arr0, arr1, arr2])
    print("There are %s tags loaded." % target.shape)

    encoder = LabelEncoder()
    encoder.fit(target)
    encoded_target = encoder.transform(target)
    dummy_target = krs.utils.np_utils.to_categorical(encoded_target)

    return dummy_target
  
target = load_label()
Copy the code

Text preprocessing

max_sequence_length = 30
embedding_size = 50

# title participle
titles = [".".join(jb.cut(t, cut_all=True)) for t in titles]

# word2vec word bagging
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(max_sequence_length, min_frequency=1)
text_processed = np.array(list(vocab_processor.fit_transform(titles)))

# Read word tags
dict = vocab_processor.vocabulary_._mapping
sorted_vocab = sorted(dict.items(), key = lambda x : x[1])
Copy the code

Building a neural network

Embedding and LSTM are used as the first two layers and the output is activated by Softmax

Configure the network structure
def build_netword(num_vocabs):
    Configure the network structuremodel = krs.Sequential() model.add(krs.layers.Embedding(num_vocabs, embedding_size, Input_length = max_sequence_length) model. The add (the KRS. The layers. LSTM (32, dropout = 0.2, Recurrent_dropout = 0.2) model. The add (the KRS. The layers. Dense (3)) model. The add (the KRS. The layers. The Activation,"softmax"))
    model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

    return model
  
num_vocabs = len(dict.items())
model = build_netword(num_vocabs=num_vocabs)

import time
start = time.time()
# Training model
model.fit(text_processed, target, batch_size=512, epochs=10, )
finish = time.time()
print("Training time: %f seconds" %(finish-start))
Copy the code

Predicting samples

Sen can be replaced with your own sentence. The predicted result is [probability of health articles, probability of science and technology articles, probability of design articles]. The article with the highest probability is in that category, but the article with the maximum probability lower than 0.8 is judged to be unclassifiable.

sen = "Tips for good business Design."
sen_prosessed = "".join(jb.cut(sen, cut_all=True)) sen_prosessed = vocab_processor.transform([sen_prosessed]) sen_prosessed = np.array(list(sen_prosessed)) Result = model.predict(sen_prosessed) catalogue = list(result[0]).index(Max (result[0])) threshold=0.8if max(result[0]) > threshold:
    if catalogue == 0:
        print("This is an article about health.")
    elif catalogue == 1:
        print("This is an article about technology.")
    elif catalogue == 2:
        print("This is an article about design.")
    else:
        print("This article has no credible classification.")
Copy the code