The original link https://jinkey.ai/post/tech/xiang-mian-fei-yong-gu-ge-zi-yuan-xun-lian-shen-jing-wang-luo-colab-xiang-xi-shi-yong-jiao-c This article is written by Jinkey (wechat public account Jinkey-love, official website https://jinkey.ai). This article is allowed to be reproduced without tampering with the signature. Deleting or modifying this section of copyright information is regarded as infringement of intellectual property rights, and we reserve the right to pursue your legal liability.
1 introduction
Colab is an interactive Python environment for The Google Internal Class Jupyter Notebook. Support for Google bucket (TensorFlow, BigQuery, GoogleDrive, etc.), support for PIP installation of any custom library. Website: https://colab.research.google.com
Installation and use of library
Colab comes with deep learning libraries such as Tensorflow, Matplotlib, Numpy, and Pandas. If you need other dependencies, such as Keras, you can create a new code block and type
#Install the latest version of Keras
# https://keras.io/! pip install keras#Version specific installation! PIP install keras = = 2.0.9#Install OpenCV
# https://opencv.org/! apt-get -qq install -y libsm6 libxext6 && pip install -q -U opencv-python#Install Pytorch
# http://pytorch.org/! PIP install - q torchvision at http://download.pytorch.org/whl/cu75/torch-0.2.0.post3-cp27-cp27mu-manylinux1_x86_64.whl#Install XGBoost
# https://github.com/dmlc/xgboost! pip install -q xgboost#Install 7 zip! apt-get -qq install -y libarchive-dev && pip install -q -U libarchive#Install GraphViz and PyDot! apt-get -qq install -y graphviz && pip install -q pydotCopy the code
3 Google Drive file operations
Authorized to log in
For the same Notebook, you only need to log in once before you can read or write the notebook.
Install the PyDrive operation library. This operation should only be performed once per notebook! pip install -U -q PyDrive from pydrive.auth import GoogleAuth from pydrive.drive import GoogleDrive from google.colab import auth from oauth2client.client import GoogleCredentialsThe first login will be authenticated
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
Copy the code
After executing this code, the following content will be printed. Click Connect to authorize login, obtain the token value and fill in the input box. Press Enter to continue to complete login.
Directory traversal
List all files in the root directory
# "q" tutorial can be found in the query condition: https://developers.google.com/drive/v2/web/search-parameters
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
print('title: %s, id: %s, mimeType: %s' % (file1['title'], file1['id'], file1["mimeType"]))
Copy the code
You can see the console print
Title: Colab test, id: 1 cb5chksdl26amxq5xrqk2kabv5lskisj8huedyzpeqq, mimeType: application/VND. Google – apps. The document
title: Colab Notebooks, id: 1U9363A12345TP2nSeh2K8FzDKSsKj5Jj, mimeType: application/vnd.google-apps.folder
Where id is the unique identifier for the file to be retrieved for the rest of the tutorial. According to the mimeType Colab can know the test file for the doc document, and Colab Notebooks for folder (namely Colab Notebook storage root directory), if you want to query Colab Notebooks file folder, Query conditions can be written as follows:
# 'directory id' in parents
file_list = drive.ListFile({'q': "'1cB5CHKSdL26AMXQ5xrqk2kaBv5LBkIsJ8HuEDyZpeqQ' in parents and trashed=false"}).GetList()
Copy the code
Reading file contents
At present, the format of the content that can be directly read is.txt (mimeType: text/plain). The reading code is as follows:
file = drive.CreateFile({'id': "Replace it with your.txt file ID"})
file.GetContentString()
Copy the code
.csv, on the other hand, prints only the first line of data using GetContentString()
file = drive.CreateFile({'id': "Replace it with your.csv file ID"})
The download operation here is cache only. It will not download another file in your Google Drive directory
file.GetContentFile('iris.csv'."text/csv")
Print the file content directly
with open('iris.csv') as f:
print f.readlines()
# use pandas to read text
import pandas
pd.read_csv('iris.csv', index_col = [0, 1], skipinitialspace = True)Copy the code
Colab will direct the output in the form of table (below for the interception of the iris data set the first few lines), iris data set address for http://aima.cs.berkeley.edu/data/iris.csv, Students can perform the upload to their Own Google Drive.
Write file operation
Create a text file
uploaded = drive.CreateFile({'title': 'example. TXT'})
uploaded.SetContentString('Test content')
uploaded.Upload()
print('create file id {}'.format(uploaded.get('id')))
Copy the code
More actions are available at http://pythonhosted.org/PyDrive/filemanagement.html
4. Google Sheets spreadsheet operation
Authorized to log in
For the same Notebook, you only need to log in once before you can read or write the notebook.
! pip install --upgrade -q gspread from google.colab import auth auth.authenticate_user() import gspread from oauth2client.client import GoogleCredentials gc = gspread.authorize(GoogleCredentials.get_application_default())Copy the code
read
For the demo, import the iris.csv data to create a Google Sheet file that can be placed in any directory on Google Drive
worksheet = gc.open('iris').sheet1
Get a list [
# [row 1 column 1, row 1 column 2,..., row 1 column n],... ,[row n, column 1, row n, column 2... row n, column n]]
rows = worksheet.get_all_values()
print(rows)
# use pandas to read text
import pandas as pd
pd.DataFrame.from_records(rows)
Copy the code
The printed results are respectively
[[‘ 5.1 ‘, 3.5 ‘ ‘, ‘1.4’, 0.2 ‘ ‘, ‘setosa], [‘ 4.9’, ‘3’, ‘1.4’, 0.2 ‘ ‘, ‘setosa],…
write
sh = gc.create('Google Table')
Open workbooks and worksheets
worksheet = gc.open('Google Table').sheet1
cell_list = worksheet.range('A1:C2')
import random
for cell in cell_list:
cell.value = random.randint(1, 10)
worksheet.update_cells(cell_list)
Copy the code
5 Download the file to the local PC
with open('example.txt', 'w') as f:
F. rite(' Test contents ')
files.download('example.txt')
Copy the code
6 of actual combat
Here to me in a lot of open source LSTM text classification project master for example https://github.com/Jinkeycode/keras_lstm_chinese_document_classification/data The three files in the directory are stored on Google Drive. This example demonstrates categorizing the headings of health, technology, and design categories.
new
Create a new Python2 notebook on Colab
Install dependencies
! pip install keras ! pip install jieba ! pip install h5py import h5py import jieba as jb import numpy as np import keras as krs import tensorflow as tf from sklearn.preprocessing import LabelEncoderCopy the code
Load the data
Authorized to log in
Install the PyDrive operation library. This operation should only be performed once per notebook! pip install -U -q PyDrive from pydrive.auth import GoogleAuth from pydrive.drive import GoogleDrive from google.colab import auth from oauth2client.client import GoogleCredentials def login_google_drive():The first login will be authenticated
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)
return drive
Copy the code
List all files under GD
def list_file(drive):
file_list = drive.ListFile({'q': "'root' in parents and trashed=false"}).GetList()
for file1 in file_list:
print('title: %s, id: %s, mimeType: %s' % (file1['title'], file1['id'], file1["mimeType"]))
drive = login_google_drive()
list_file(drive)
Copy the code
Cache data to the work environment
def cache_data():
# id replaces the corresponding file ID read in the previous step
health_txt = drive.CreateFile({'id': "117GkBtuuBP3wVjES0X0L4wVF5rp5Cewi"})
tech_txt = drive.CreateFile({'id': "14sDl4520Tpo1MLPydjNBoq-QjqOKk9t6"})
design_txt = drive.CreateFile({'id': "1J4lndcsjUb8_VfqPcfsDeOoB21bOLea3"})
The download operation here is cache only. It will not download another file in your Google Drive directory
health_txt.GetContentFile('health.txt'."text/plain")
tech_txt.GetContentFile('tech.txt'."text/plain")
design_txt.GetContentFile('design.txt'."text/plain")
print("Cache successful")
cache_data()
Copy the code
Read data from the working environment
def load_data():
titles = []
print("Loading health category data...")
with open("health.txt"."r") as f:
for line in f.readlines():
titles.append(line.strip())
print("Loading tech data...")
with open("tech.txt"."r") as f:
for line in f.readlines():
titles.append(line.strip())
print("Loading design category data...")
with open("design.txt"."r") as f:
for line in f.readlines():
titles.append(line.strip())
print("There are %s titles loaded." % len(titles))
return titles
titles = load_data()
Copy the code
Load the label
def load_label():
arr0 = np.zeros(shape=[12000, ])
arr1 = np.ones(shape=[12000, ])
arr2 = np.array([2]).repeat(7318)
target = np.hstack([arr0, arr1, arr2])
print("There are %s tags loaded." % target.shape)
encoder = LabelEncoder()
encoder.fit(target)
encoded_target = encoder.transform(target)
dummy_target = krs.utils.np_utils.to_categorical(encoded_target)
return dummy_target
target = load_label()
Copy the code
Text preprocessing
max_sequence_length = 30
embedding_size = 50
# title participle
titles = [".".join(jb.cut(t, cut_all=True)) for t in titles]
# word2vec word bagging
vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(max_sequence_length, min_frequency=1)
text_processed = np.array(list(vocab_processor.fit_transform(titles)))
# Read word tags
dict = vocab_processor.vocabulary_._mapping
sorted_vocab = sorted(dict.items(), key = lambda x : x[1])
Copy the code
Building a neural network
Embedding and LSTM are used as the first two layers and the output is activated by Softmax
Configure the network structure
def build_netword(num_vocabs):
Configure the network structuremodel = krs.Sequential() model.add(krs.layers.Embedding(num_vocabs, embedding_size, Input_length = max_sequence_length) model. The add (the KRS. The layers. LSTM (32, dropout = 0.2, Recurrent_dropout = 0.2) model. The add (the KRS. The layers. Dense (3)) model. The add (the KRS. The layers. The Activation,"softmax"))
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
return model
num_vocabs = len(dict.items())
model = build_netword(num_vocabs=num_vocabs)
import time
start = time.time()
# Training model
model.fit(text_processed, target, batch_size=512, epochs=10, )
finish = time.time()
print("Training time: %f seconds" %(finish-start))
Copy the code
Predicting samples
Sen can be replaced with your own sentence. The predicted result is [probability of health articles, probability of science and technology articles, probability of design articles]. The article with the highest probability is in that category, but the article with the maximum probability lower than 0.8 is judged to be unclassifiable.
sen = "Tips for good business Design."
sen_prosessed = "".join(jb.cut(sen, cut_all=True)) sen_prosessed = vocab_processor.transform([sen_prosessed]) sen_prosessed = np.array(list(sen_prosessed)) Result = model.predict(sen_prosessed) catalogue = list(result[0]).index(Max (result[0])) threshold=0.8if max(result[0]) > threshold:
if catalogue == 0:
print("This is an article about health.")
elif catalogue == 1:
print("This is an article about technology.")
elif catalogue == 2:
print("This is an article about design.")
else:
print("This article has no credible classification.")
Copy the code