Custom sklearn gridsearch to deal with arbitrary length of parameters

Posted by J.J. Young on Apr 25, 2017

In machine learning, there is one question that can not be avoided - tuning the parameters. There have been numerous ways of tuning parameters, mostly they root in following catageries. e.g. gradient based methods, sampling methods and varational methods. Some other approaches such as herustics optimaztion or dynamic programming are excluded. All of these can be viewed as kinds of optimization oriented methods. Someone might think that sampling methods are just doing things like estimating the density of distribution, but actually sampling methods go much beyond that, it has close relation with EM and has some physical interpretation such as annelling methods. Since hyperparameter is the main focus of this article. We will disscuss sampling in a seperate blog.

Hyperparameter - a Bird Eye View

When talking about hyperparameter, there is a full literature of reseach and study on this. The hyperparameter itself can be viewed as samples from a certain distubtion over some space, it can influence the shape of decision boundary. Though I encounter a lot of people stating that hyperparaters can not be optimised,otherwise you will need hyper-hyperparameters. But when we look at the minds of people, suppose a person is looking at a tree. He can actually realize that he is looking at a tree. Furthermore, he can even realize that he has realized he is looking at the tree. He can not only make two stage realization but in a infinate loop style. This is some kind of recussive structure and self-reference. Thus why not we can has a alogrithm that can learn from the world and optimise its intrisinc hyperparameters in a recursive and self-reference manner? This is beyond the scope of this blog, now we will focus on a very early stage of how to automatically tuning the hyperparameter, the answer is naive and easy, just to grid search over the entire space.

Grid Search with Sklearn

For data scientist using Python, sklearn is one of the most popular tool of doing machine learning tasks and grid search method has been implemented in it. There is a nice blog that shows how to implement a customized estimator in sklearn and tells us how to automatically tune the parameter with grid search for your customized estimator. This article is very helpful to build your own customized estimator that can utilizing the grid search function provided by sklearn. But sometimes in some cases I encountered, the parameter I want to tune is from a json file. This json file has a lot information that contains your bussiness functionality. So can we use a generic way to parse the json into parameters and thus use grid search? The gap is that the parameters you want to tune must be already declared in the init method of your customized estimator and has exactly the same name as init function parameters as well as the parameters used by the estimator. To bridge the gap, following methods is used.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def fit(self, X, y=None):
    self.threshold_ = self.status.get('threshold', 0)
    return self


def _meaning(self, x):
    return True if x >= self.treshold_ else False


def predict(self, X, y=None):
    try:
        getattr(self, "threshold_")
    except AttributeError:
        raise RuntimeError("You must train classifer before predicting data!")
    return [self._meaning(x) for x in X]


def score(self, X, y=None):
    return sum(self.predict(X))


function_template = """
def {function_name}(self, {arguments}):
    args, _, _, values = inspect.getargvalues(inspect.currentframe())
    values.pop("self")
    for arg, val in values.items():
        setattr(self, arg, val)
    self.conf_parser = MLConfParser()
""".strip()


def create_constructor(config_parser, func_name='init_constructor'):
    exec(function_template.format(function_name=func_name, arguments=config_parser.get_params_str()))
    return locals()[func_name]


GenericInitClassifier = type("GenericInitClassifier", (BaseEstimator, ClassifierMixin), {
    "status": {},
    "fit": fit,
    "_meaning": _meaning,
    "__init__": create_constructor(MLConfParser()),
    "predict": predict,
    "score": score
})

As you can see from above, the most important difference between this methods and the one in the recommended blog is that here it used type methods and eval methods to generate the estimator class. There are two reasons, first because the bussiness related json will be translated into init parameter by MLConfParser. MLConfParser will handle translate json into init parameter and vise vesa. Second, type method is used because it tries to minimize the scope eval method. Though eval may seems unavoidable, if you want to custom init with arbitrary json. In some sense, we have realized that create a estimator with arbitary length of parameter that still can utilize grid search provided by sklearn.

Last Few Words

At last we almost finish our story, to actully put above into use, we can create a real class that inheritate from above and realize the methods such as train and predict. FeatureNormalizer is class that wrapps transforming the parameters back into bussiness related json. This finished our story here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
class YourClassifier(GenericInitClassifier):
    normalizer = None
    prediction = None
    train_data = None
    predict_data = None
    conf_parser = None

    def parse_kv_conf(self, kv_conf=None):
        return self.conf_parser.get_norm_conf(kv_conf or self.__dict__)

    def fit(self, X, y=None):
        self.normalizer = FeatureNormalizer(self.parse_kv_conf())
        self.train_data = X
        return self

    def predict(self, X):
        conf = {}
        status = {}

        self.prediction = self.predict_data.pipe(self.normalizer) \
            .pipe(PredictByScore(), status=status)
            .pipe(lambda df: df[['id', 'prediction']].set_index('id'))

        return self.prediction.loc[X['id']]['prediction']

    def score(self, X, y=None):
        return f1_score(y, self.predict(X))