Research road is long, pay attention to xiao Zeng, shares do not lose, xiao Zeng and you encourage progress together. ALipy @[TOC] Is a Python tool library for active learning developed by the Key Laboratory of Pattern Analysis and Machine Intelligence of the Ministry of Industry and Information Technology, School of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics.

ALipy– Active learning in Python

ALiPy provides a module-based implementation of an active learning framework that allows users to easily evaluate, compare, and analyze the performance of active learning methods. It implements more than 20 algorithms and allows users to easily implement their own methods in different Settings.

The characteristics of ALipy

  • Model independence: There are no limitations to the classification model. SVM can be used in SkLearn or depth models can be used in TensorFlow as needed.
  • Module independence: You can modify one or more modules of the toolkit without affecting other modules.
  • Implement your own algorithm without inheriting anything: User-defined functions have few restrictions, such as parameters or names
  • Variant Settings supported: noisy predictor, multiple tags, cost effectiveness, feature queries, etc
  • Powerful tools: save and load intermediate results; Multithreading; Analysis of experimental results

ALipy module

The active learning implementation is decomposed into components, and ALipy is developed based on multiple modules, each corresponding to a component of the active learning process.

Module components The basic function
alipy.data manipulate Provides basic functions of data preprocessing and partitioning
alipy.query strategy It consists of 25 common query policies
alipy.index.IndexCollection Helps manage indexes for both tagged and untagged examples
alipy.metric Multiple criteria are provided to evaluate model performance
alipy.experiment.state and alipy.experiment.state io Helps to save intermediate results after each query and to recover the program from breakpoints
alipy.experiment.stopping criteria Some common stop conditions are implemented
alipy.oracle Different Oracle Settings are supported
alipy.experiment.experiment analyser It provides the collection, processing and visualization of experimental results
alipy.utils.multi thread A parallel implementation of k – times experiment is provided

The above modules are designed and realized independently. In this way, the code between different parts can be implemented without restriction. In addition, each individual module can be replaced by the user’s own implementation, and within each module we have provided a high degree of flexibility to enable the toolbox to adapt to different Settings.

Example selected AL implementation framework Noisy Oracles’ AL implementation framework An AL implementation framework for different cost datasets AL implementation framework for instance query

The installation of ALipy

Python >=3.4 Basic library numpy scipy scikit-learn matplotlib prettyTable there are two main installation schemes: PIP installation and source code build

PIP Installation (choose one of three)

  • Installing Alipy from PyPI (recommended) :
sudo pip install alipy
Copy the code
  • PIP install in the home directory:
pip install --user alipy
Copy the code
  • Get the latest source from github repository PIP Install:
pip install git+https://github.com/NUAA-AL/alipy.git
Copy the code

The source code to build

  • Clone alipy to a local directory, CD to the alipy folder and run the install command:
cd ALiPy
sudo python setup.py install
Copy the code
  • Build and install from source code in your home directory:
python setup.py install --user
Copy the code
  • All users on Unix/Linux build and install from source:
python setup.py build
Copy the code

ALipy special Settings

The most striking feature of ALipy is its low coupling, which makes it easy to experiment in other special environments.

Active learning setting Introduction to the
AL with Noisy Oracles Sometimes the wrong label may be returned
AL for Multi-Label Data An instance associates multiple labels simultaneously
AL with Different Costs The cost of querying different tags can vary
AL by Querying Features Select missing functionality for the instance to query
AL with Novel Query Types Other types of information about the query instance than the label of the query instance
AL for Large Scale Tasks Active learning in big data

The algorithm implemented by ALipy

ALiPy provides more than 20 advanced algorithms for different active learning Settings

Specific code implementation process

The code implementation area is divided into Alipy primer and advanced guide

Introduction to ALipy

I’ll show you a simple example of customizing active learning experiments using tools in Alipy, starting with a unified framework for active learning experiments, followed by the corresponding tools in Alipy.

Unified framework for active learning experiments

1. For example, get a characteristic matrix X [n_samples, n_features] with shape and the corresponding one with shape [n_samples] [If it is not easy to get a specific characteristic matrix, you can only operate on the index of the instance] and split the data into training/test sets for experiments. Data partitions should be repeated several times at random. In active learning, the training set should be further split into initial tag set and untag pool for query. Note that in most active learning setups, the initial set of tags is usually small. 2. You can begin the query process for each experiment fold and record its results. In each query iteration, a subset of untagged data is queried and added to the tag set; After that, the model is retrained and tested against the updated tag set to evaluate the query. 3. After all folds are completed, the learning curve of the query strategy can be obtained by averaging the performance curve of each fold.

Modules in ALipy

  • Call traditional and state-of-the-art methods using alipy.query_strategy.

  • Using alipy. Index. IndexCollection to manage tags index and untagged index.

  • Use alipy.metric to calculate your model performance.

  • Use alipy.experiment.state and alipy.experiment.state_io to save the intermediate results after each query and restore the program from the breakpoint.

  • Use alipy. Experiment. Stopping_criteria to get some sample to stop criteria.

  • Using alipy. Experiment. Experiment_analysisr to collect, process, and visualize your experimental results.

For experienced users, a complete example of an experiment implemented using Alipy is provided. Then, we’ll explain the code separately and introduce common methods in the above tools.

import copy from sklearn.datasets import load_iris from alipy import ToolBox X, y = load_iris(return_X_y=True) alibox = ToolBox(X=X, y=y, query_type='AllLabels', Alibo.split_al (test_ratio=0.3, initial_label_rate=0.1, Split_count =10) # Use the default logistic regression classifier model = alibo.get_default_model () # cost budget is 50 queries stopping_criterion = alibox.get_stopping_criterion('num_of_queries', 55) # use predefined strategies uncertainStrategy = alibo.get_query_strategy (strategy_name='QueryInstanceUncertainty') unc_result = [] for round in range(10): Train_idx, test_IDx, label_IND, Unlab_ind = alibo.get_split (round) # Get the intermediate result of the single folding experiment. Saver saver = alibo.get_stateio (round) # Set the initial performance point model.fit(X=X[label_ind.index, :], y=y[label_ind.index]) pred = model.predict(X[test_idx, :]) accuracy = alibox.calc_performance_metric(y_true=y[test_idx], y_pred=pred, performance_metric='accuracy_score') saver.set_initial_point(accuracy) while not stopping_criterion.is_stop(): Select (label_ind, unlab_ind, model=model, Batch_size =1) # or pass your proba prediction result # prob_pred = model.predict_proba(x[unlab_IND]) # select_Ind = uncertainStrategy.select_by_prediction_mat(unlabel_index=unlab_ind, predict=prob_pred, Batch_size =1) label_ind.update(select_IND) unlab_ind.difference_update(select_IND) # Update the model and compute the performance model according to the model you use model.fit(X=X[label_ind.index, :], y=y[label_ind.index]) pred = model.predict(X[test_idx, :]) accuracy = alibox.calc_performance_metric(y_true=y[test_idx], y_pred=pred, Performance_metric ='accuracy_score') # save intermediate result to file st = alibo. State(select_index=select_ind, Performance =accuracy) saver.add_state(st) saver.save() # Pass the current progress to the stop standard object stopping_criteria.update_information (saver) # Reset the stopping_criterio.reset () unc_result.append(copy.deepCopy (Saver)) analyser = alibox.get_experiment_analyser(x_axis='num_of_queries') analyser.add_method(method_name='uncertainty', method_results=unc_result) print(analyser) analyser.plot_learning_curves(title='Example of AL', std_area=True)Copy the code

For each module, create a ToolBox object and specify a query type for the experiment (query all labels of an instance)

Alibox = ToolBOX (X = X,y = y,query_type = 'AllLabels')Copy the code
Manage marked and unmarked indexes

Alipy. Index. IndexCollection is a similar list of container, used to manage your marked and unmarked index. IndexCollection objects can be easily created by passing a List or numpy.ndarray object.

A = [1,2,3] a_ind = alibox.indexcollection (a) # Or create by importing the module from alipy.index import IndexCollection a_ind = IndexCollection(a)Copy the code

The common methods for IndexCollection are:

  • A_ind.index Specifies the index list type used to obtain the matrix index.

  • A_ind.update () is used to add a batch of indexes to an IndexCollection object.

  • A_ind.difference_update () is used to remove a batch of indexes from an IndexCollection object

Break up the data

There are two ways to split data by toolbox objects.

  1. You can split the data alibox.split_al () by specifying some options:

Split_AL (test_ratio=0.3, initial_label_rate=0.1, split_count=10) splits the dataset randomly into trained, tested, labeled, and unlabeled sets 10 times 2. You can use your own split function, Set indexes train_IDx, test_IDx, label_IDx and unlabel_IDx when initializing ToolBox objects. unlabel_idx = my_own_split_fun(X, y) alibox = alipy.ToolBox(X=X, y=y, query_type=’AllLabels’, train_idx=train_idx, test_idx=test_idx, label_idx=label_idx, unlabel_idx=unlabel_idx)

Use predefined policies to select samples

One of the core algorithms for active learning may be the query strategy. You can get query policy objects from the Alipy.ToolBox object by simply providing the policy name: uncertainStrategy = alibox.get_query_strategy(strategy_name=’QueryInstanceUncertainty’)

Using alipy.IndexCollection to manage your index, labeled index container is Lind, unlabeled container is Uind An example use of the predefined policy might look like this (just provide list types) :

select_ind = uncertainStrategy.select(label_index=Lind,
                                      unlabel_index=Uind,
                                      batch_size=1)
Copy the code
Update the test model

Available functions ‘accuracy_score’, ‘ROC_auc_SCORE’, ‘get_FPS_TPS_THRESHOLDS’, ‘hamming_loss’, ‘one_ERROR’, ‘coverage_ERROR’, ‘label_ranking_loss’, ‘label_ranking_average_Precision_score’ there are two ways to use them:

  1. Import the module and call the utility function alipy.metrics
from alipy.metric import accuracy_score
acc = accuracy_score(y_true=y, y_pred=model.predict(X))
Copy the code
  1. Calc_performance_metric () ToolBox objects
acc = alibox.calc_performance_metric(y_true=y, y_pred=model.predict(X),
                                     performance_metric='accuracy_score')
Copy the code

Senior guide

Advanced encapsulation

ToolBox– Initialize an object to get any tools

ToolBox, mentioned earlier, is a class that provides all the available tool classes. You can get them without passing redundant parameters through the ToolBox objects. 1. Initialize ToolBox objects

['AllLabels', 'PartLabels', 'Features'] From sklearn.datasets import load_iris from Alipy import ToolBox X, y = load_iris(return_X_y=True) alibox = ToolBox(X=X, y=y, query_type='AllLabels', saving_path='.')Copy the code

ALiPy provides a Logistic regression model with default parameters implemented by Sklearn

Lr_model = alipy.get_default_model() # lr_model.fit(X, y) pred = lr_model.predict(X) # get probabilistic output pred = lr_model.predict_proba(X)Copy the code

3. Split the data

Split_AL (test_ratio=0.3, initial_label_rate=0.1, split_count=10)Copy the code

Create an IndexCollection object

# alipy. Index. IndexCollection is used for alipy index management tools. A = [1,2,3] a_ind = alibox.indexcollection (a)Copy the code

The Get Oracle and Repository object Toolbox classes provide initialization of Clean Oracle

# If you need to query by feature vector, This can be done by setting query_by_example=True clean_oracle = alibo.get_clean_oracle (query_by_example=False, Cost_mat =None) # cost_mat=None You can call get_repository(round, instance_flag=False) alibo.get_repository (round=0, instance_flag=False)Copy the code

6. Get the State & StateIO object

Saver = alibo.get_stateio (round=1) # When adding a query to the StateIO object, you need to use a State object, It is a dict-like container that holds some of the necessary information about a query (the state of the current iteration), such as cost, performance, selected indexes, and so on. st = alibox.State(select_index=select_ind, performance=accuracy, cost=cost, queried_label=queried_label)Copy the code

7. Getting predefined QueryStrategy objects has been mentioned before, just to give a brief introduction

QBCStrategy  =  alibox 。get_query_strategy ( strategy_name = 'QueryInstanceQBC' )
Copy the code

8. Computational performance

# Examples of using the Calc_performance_metric () ToolBox object method:  acc = alibox.calc_performance_metric(y_true=y, y_pred=model.predict(X), performance_metric='accuracy_score')Copy the code

Alipy implements some common stop criteria:

  • No unlabeled samples available (default)
  • Reaches the preset query count
  • Meet preset cost limits
  • The default percentage of untagged pools is tagged
  • Reach the preset run time (CPU time)
# [None, 'num_of_queries', 'cost_limit', 'percent_of_unlabel', Get_stopping_criteria ='num_of_queries', value=50)Copy the code

10. Get the experimental analyzer

# use alipy. Experiment. Analyser tools Analyser. = alibox get_experiment_analyser (x_axis = 'num_of_queries')Copy the code

Get the aceThreading object

# alipy.utils.acethReading is a class to parallel your K-fold experiments and print the state of each thread. acethread = alibox.get_ace_threading ()Copy the code

12. Save and load ToolBox objects

# alibox = toolbox. load('./al_settings.pkl')Copy the code
AIExperiment- a few lines of code running the AL algorithm example

ALipy provides a class that encapsulates the various tools, directly implement active learning the main loop of the ALipy. Experient. Alneatent 】 note: AlExament only supports the most common query – an instance of all tags.

Code implementation # initialization & function model parameters are classified model objects, Scikit-learn API from sklearn.datasets import load_iris from alipy.experiment.al_experiment import AlExperiment X, y = load_iris(return_X_y=True) al = AlExperiment(X, y, stopping_criteria='num_of_queries', Stopping_value =50) # use built-in functions to generate new split al.split_al () # have implemented classic and advanced query strategies, The list of available policy names includes ['QueryInstanceQBC', 'QueryInstanceUncertainty', 'QueryRandom', 'QureyExpectedErrorReduction', 'QueryInstanceGraphDensity', 'QueryInstanceQUIRE', 'QueryInstanceBMDR', 'QueryInstanceSPAL', 'QueryInstanceLAL'] # The GraphDensity and Quire methods require additional parameters al.set_query_strategy(strategy="QueryInstanceUncertainty", Measure ='least_confident') # Set performance metrics. ALiPy has implemented many classic performance metrics, #['accuracy_score', 'ROC_auc_score ',' get_fps_tPS_THRESHOLDS ',' hamming_loss','one_error', 'coverage_error', 'label_ranking_loss', 'label_ranking_average_precision_score', 'zero_one_loss'] al.set_performance_metric('accuracy_score') # By default, k times of active learning run al.start_query(multi_thread=True) # to get experimental results # can be obtained via al.get_example_result (). Obtain the results of k StateIO objects list for K experiments. You can also draw the learning curve of k experiments with a.lot_learning_curve (title=None).Copy the code

Utility classes in Alipy

For those who are not familiar with or have questions about a module, you can visit this address directly:Parnec.nuaa.edu.cn/_upload/tpl…The specific use of each module will be introduced and analyzed in detail

If you think this article is helpful to you, I hope you can click on the following, comments, favorites, thank you

Please also pay attention to Xiaozeng, you can not lose a stake, I will record the bit by bit in the process of my graduate study, we encourage together!

The paper has been uploaded: download.csdn.net/download/qq… GitHub link: github.com/NUAA-AL/ali… ALipy website link: parnec.nuaa.edu.cn/_upload/tpl… I have also read Inji’s article during the preparation period, and I have also gained a lot. If you are interested, you can have a look at blog.csdn.net/weixin_4457…