GSoC First Evaluations

This blog summarizes what I have done before GSoC’s first evaluation. I will talk about three cool things:

refactoring all Machine classes to not have labels/features as part of their state
finding all classes needed to be refactored with LibTooling
writing a LabelEncoder to allow users to pass arbitrary discrete values as classification labels.

Refactoring all Machine class

Currently, there are two most important member functions in the shogun Machine class: train and apply, the current usage is:

shared_ptr<Features> train_data;
shared_ptr<Labels> train_labels;
auto reg = create<Machine>("LeastSquaresRegression");
reg->set_labels(train_labels);
reg->set_features(train_data);
reg->train();

The Machine class stores features and labels, i.e. it is stateful, and causes confusion about which features and labels are used for training.
Features and Labels should not be stored in the object, we should use the features and labels which are passed to train and apply. So I started by refactoring this class to make it stateless.
The new API looks like this

auto test_labels = create<Machine>("LeastSquaresRegression")
                        ->fit(train_features, train_labels)
                        ->predict(test_data);

We want to refactor the Machine base class to be stateless, but not all algorithms can be stateless. Non-parametric machines (like KNN, kernel ridge regression, and kernel SVMs, need labels and features to be stored in class. For example, k-Nearest Neighboris one of them, KNN needs the features and labels to be stored in class, and then the stored features/labels are applied to new data. So it is convenient to divide the whole Machine family into non-parametric and parametric. The basic idea is to add a new interface called NonParametricMachine, then change all classes (such as KNN, GP) that require the training labels/features when being applied to new data to become subclasses of NonParametricMachine. The features and labels are stored in NonParametricMachine class, and when all the refactoring is done, those will be removed from the Machine class. Meanwhile, we also want to only set labels and feature when train is called. The typical non-parametric refactor looks like this:

Remove all constructor parameter that involving in labels/features.

-KNN::KNN(int32_t k, const std::shared_ptr<Distance>& d, const std::shared_ptr<Labels>& trainlab, KNN_SOLVER knn_solver)
+KNN::KNN(int32_t k, const std::shared_ptr<Distance>& d, KNN_SOLVER knn_solver)

Remove default parameter NULL in apply and train.

-virtual std::shared_ptr<MulticlassLabels> train_machine(std::shared_ptr<Features> data=NULL);
+virtual std::shared_ptr<MulticlassLabels> train_machine(std::shared_ptr<Features> data);

bool KNN::train_machine(std::shared_ptr<Features> data){
-if (data)
-{
	require(
		    m_labels->get_num_labels() == data->get_num_vectors(),
		    "Number of training vectors ({}) does not match number of labels "
		    "({})",
		    data->get_num_vectors(), m_labels->get_num_labels());
-	distance->init(data, data);
-}
+m_features = data;
+distance->init(data, data);

Finding all class needed to be refactored with LibTooling

We want to refactor all classes that are derived from Machine, the first step is to find all the lines of code where the class accesses m_labels.

Shogun has a large code base, so it is very hard to simply find all the lines of code that make Machine stateful with respect to m_labels. So my mentors suggested that I use libtooling to find all these instances. LibTooling is a really powerful tool, with a steep learning curve, used by the clang compiler to parse C++ code. After about one week, I have finally written this script.

The main matcher is memberExpr(member(hasName("m_labels"))).bind("func"), then we can get the AST of the MemberExpr. However, just printing all the lines of the MemberExpr is not readable. We should add more context, such as class name, method name. The LLVM AST has builtin functionality that helped me find the method and class names. We can use getParents to get CXXMethodDecl, and when we get CXXMethodDecl, it is easy to use getParent to get CXXRecordDecl. But the story does not end here, we only want to get every CXXRecordDecl that are derived from Machine. We have to restrict the base of the found CXXRecordDecl to be derived from Machine.

auto record = method->getParent();
bool is_derived_from_machine = false;
llvm::SmallPtrSet<const CXXRecordDecl *, 4> Bases;
auto Collect = [&Bases](const CXXRecordDecl *Base) {
  Bases.insert(Base);
  return true;
};
record->forallBases(Collect);
for (auto &&base : Bases)
{
  if (base->getNameAsString() == "Machine")
  {
    is_derived_from_machine = true;
    break;
  }
}

`LabelEncoder`

Currently, there are many Labels implemented in shogun, BinaryLabels and MulticlassLabels are the
most commonly used. BinaryLabels contains two valid values {-1, +1}, and the valid values of MulticlassLabels is {0...nr_classes-1}(contiguous!), but the current implementation in shogun does not ensure the values in Labels are valid, so for some algorithms which require the valid values, the mapping from origin values to continuous values is needed Therefore, I designed the LabelEncoder base class and respective BinaryLabelEncoder and MulticlassLabelsEncoder derived classes to solve this problem.

The main idea is simple, maintain a mapping from origin values to continuous values, and create the respective inverse mapping. When train is called, the mapping and respective inverse mapping are stored. The mapping is used to transform the input labels to continuous values and the inverse operation is performed in apply.

The interfaces of LabelEncoder mainly refer to sklearn, the basic usage looks like this:

auto label_encoder = std::make_shared<BinaryLabelEncoder>();
SGVector<int32_t> vec{-100, 200, -100, 200, -100};
auto origin_labels = std::make_shared<BinaryLabels>(vec);
auto unique_vec = label_encoder->fit(origin_labels); // unique_vec = {-100, 200}, the mapping is stored.
auto continuous_labels = labels_encoder->transform(origin_labels); // continuous_labels = {-1, 1 ,-1, 1, -1}

auto origin_labels = labels_encoder->inverse_transform(continuous_labels); // continuous_labels are transformed back to {-100, 200, -100, 200, -100}