progress

I'd rather be anything but ordinary

0%

GSoC 2020 has almost come to an end, this blog summarizes what I have done before GSoC’s final evaluation. There are three things that I want to introduce.

  • Composite class to combinate multiple machine learning algorithms
  • Suggest class name if class name not found
  • CrossValidation wrapper
    Read more »

GSoC week 2

We want to refactor Machine base class to be stateless, but not all algorithms can be stateless. The non-parametric machine need labels and features to be stored in class. So it is convenient that divide the whole machine family into non-parametric and parmetric. The basis idea is add a new interface called NonParmetric, then change all class (such as knn, gp) that require labels and features become the subclass of NonParmetric. Meanwhile, we also want to only set labels and feature when the train is called.

#5055 Add NonParametricMachine class

#5053 Refactor NearestCentroid class

This is my first blog aboud GSoC

Currently, the shogun Machine base class has two main class member functions: train and apply. The current usage is:

shared_ptr<Features> train_data;
shared_ptr<Labels> train_labels;
auto clf = create<LeastSquaresRegression>();
clf->set_labels(train_labels);
clf->set_features(train_data);
clf->train();

The Machine class is stateful, which makes people confused about which features and labels are trained. Features and Labels should not be stored in the object, so I started by refactoring this class to make it stateless.
The usage of the new API should look like this:

auto test_labels = create<LeastSquaresRegression>()
->fit(train_features, train_labels)
->predict(test_data);

First, we need to refactor all classes that are derived from Machine. The first step is to find all the lines of code where the class accesses m_labels.

Shogun has a large code base, so it is very hard to simply find all the lines of code that make Machine stateful with respect to m_labels. So my mentors suggested that I use libtooling to find all these instances. Libtooling is a really powerfull tool, with a steep learning curve, used by the clang compiler to parse C++ code. After about one week, I have finally written this script.

The main matcher is memberExpr(member(hasName("m_labels"))).bind("func"), then we can get the AST of the MemberExpr. However, just printing all the lines of the MemberExpr is not readable. We should add more context, such as class name, method name. The LLVM AST has builtin functionality that helped me find the method and class names. We can use getParents to get CXXMethodDecl, and when we get CXXMethodDecl, it is easy to use getParent to get CXXRecordDecl. But the story does not end here, we only want to get every CXXRecordDecl that are derived from Machine. We have to restrict the base of the found CXXRecordDecl to be derived from Machine.

auto record = method->getParent();
bool is_derived_from_machine = false;

llvm::SmallPtrSet<const CXXRecordDecl *, 4> Bases;
auto Collect = [&Bases](const CXXRecordDecl *Base) {
Bases.insert(Base);
return true;
};
record->forallBases(Collect);
for (auto &&base : Bases)
{
if (base->getNameAsString() == "Machine")
{
is_derived_from_machine = true;
break;
}
}

#1

Thanks:

gf712 helps me fix a lot of grammar errors of this post.

C++ variadic template is a template with at least one parameter pack,A template parameter pack is a template parameter that accepts zero or more template arguments (non-types, types, or templates).

Read more »