Introduction to Learning to Rank

7 min readApr 12, 2023

Ranking models are widely used in search and personalization systems. If you’ve ever wondered how those search results or product recommendations get magically sorted to suit your tastes, you’re in for a treat! Today, we’re delving into the fascinating world of learning to rank, where algorithms compete to be the top dog in the realm of personalisation. However, despite their popularity, I couldn’t find an exhaustive guide while implementing ranking models. So, I decided to share my learnings in a straightforward manner.

But wait, here’s the catch, my friend, Higher rank doesn’t mean control in the end! It’s just a burden, it takes its toll, Not a privilege, but a responsibility, on the whole! :P

In this article, I’ll guide you through techniques for creating ranking models, along with tips on how to create and run your own. Let’s dive in deep and get ready to unravel this fascinating world of rankings!

How is ranking different from traditional ML problems?

Starting with a naive approach, we can consider ranking as a classification problem where we label our items based on clicks or bookings. For example, if an item is clicked, we label it as 1, while non-clicked items can be labeled as 0 (binary classification). Similarly, we can include another label for booked items (multi-class classification). With this approach, we can train a classification model to classify our set of items into relevant/non-relevant items and rank them accordingly in our final list. However, the problem with this approach is that the position of the training instance in the item-set isn’t considered in a ranked list (i.e., we do not have context of the entire list of items, rather we consider a single item at a time), so irrelevant items could be given more importance.

Pairwise Ranking

Pairwise ranking is like a game of comparing separate pairs of items in a list and picking the one that’s higher ranked, or choosing both if they’re equally ranked. It’s like choosing between fries and salad, or having both if you can’t decide! Pairwise approaches are the cool kids in town because they work better in practice than point-wise and list-wise approaches. After all, predicting the relative order of items is closer to the true nature of ranking, rather than just predicting classes like binary or multi-class classification. It’s like trying to rank cat videos — it’s all about the relative cuteness factor!

Intuition for pairwise loss — Consider normal binary cross entropy loss defined as

Now, instead of taking the probability/label for a class, we can compute this in such a way that it includes some context by providing the probability of the difference in scores of these two items.

You can compute the probability using the below-mentioned expression, where we apply sigmoid to the difference in scores between two items of a ranked list.

The above-mentioned approach can be considered as an ‘aha’ moment, it’s the intution RankNet algorithm. You might be thinking that by simply replacing the loss function in our neural network implementation, we can easily convert a point-model to a pairwise model. But hold your horses! You also need to consider the fact that initially, we were only taking n-items into account (assuming n is the length of your list). So, you only had to predict n scores. But now, with pairwise ranking, you have to predict scores for each and every pair, which can quickly multiply if we consider the worst case scenario. Talk about increasing workload! There can be multiple optimisations on top of it, like (n+1) distinct pairs that can cover the entire list, but beware, implementing that can bring a whole new set of complications (try to figure that out!).

To deal with above mentioned tasks, enters LambdaRank, the solution to all those pesky complexities! The intuition behind this genius algorithm is that instead of updating the parameters (weights in our neural network) by gradient descent, we can directly define the gradients (λ) and then update the parameters after accumulating all the pairs. This innovative approach turbocharges the model training time, allowing for lightning-fast updates and supercharged performance. Say goodbye to waiting around for those gradients to converge and hello to accelerated model training with LambdaRank!

Lambda (derivative of cost function w.r.t. score) can be defined directly as -

Here, NDCG refers to Normalised Discounted Cumulative Gain. NDCG can be calculated using given expression as —

NDCG is a measure of ranking quality. In the numerator, we take rel(i) as the label/relevance of a particular item and discount it by i, where i is the position of the item in the list. We do this for all the items in this list (assuming the length of the list is ‘p’) and sum the results (thus, cumulative). Finally, we calculate an ideal discounted cumulative gain, where all relevant items are ranked higher, and normalize our results using that, creating the ultimate benchmark for ranking awesomeness.

Simply speaking, this ranking metric returns a high value if true labels are ranked high by predictions(think of NDCG as the ranking cheerleader! It cheers the loudest when the true labels are ranked high by predictions).

If a value for p is given to the metric, it will only consider the p highest scores in the ranking.

LambdaMART

LambdaMART is a combination of LambdaRank and MART (Multiple Additive Regression Trees). MART uses gradient boosted decision trees for prediction tasks. However, LambdaMART improves this by using gradient boosted decision trees with a cost function derived from LambdaRank. This has shown better results in comparison to other ranking models on the original experimental ranking datasets.

Alright, folks, after getting an overview of learning to rank framework, it’s time to implement a simple pairwise solution.

Assuming we have a sample dataset with group_id (which typically corresponds to search_id/query_id, as we’re ranking items for a specific query), original item positions (to calculate final metrics and serve as benchmarks), a set of item features, and labels (indicating whether an item was shown as an impression, clicked, or booked) (can use this for exploration).

Code —

import pandas as pd

df = pd.read_csv('data/sample_df.csv')
df.head()

def train_test_split(df, group_name, label_name, test_percentage=0.2, random=True):
    """
    Parameters
    ----------
    df: pandas.DataFrame, dataframe containing you modelling dataset
    group_name: string, name of the column that is unique for a every list of items
    label_name: string, target variable
    test_percentage: float, optional
        The fraction of interactions to place in the test set.
    random_state: boolean, optional
        
    Returns
    -------
    X_train (dataframe), y_train(series), qids_train(array), X_test(dataframe), y_test(series), qids_test(array)
    """

    unique_groups = list(df[group_name].unique())
    
    # if we want randomness in split
    if random:
        np.random.shuffle(unique_groups)

    training = unique_groups[:int(len(unique_groups)*(1-test_percentage))] 
    testing = unique_groups[int(len(unique_groups)*test_percentage):]

    df_train = df[df[group_name].isin(training)]
    df_test = df[df[group_name].isin(testing)]
    
    # Creating a numpy array which contains train_group
    qids_train = df_train.groupby(group_name)[group_name].count().to_numpy()
    # Keeping only the features on which we would train our model 
    X_train = df_train.drop([group_name, label_name], axis = 1)
    # Relevance label for train
    y_train = df_train[label_name].astype(int)

    # Creating a numpy array which contains val_group
    qids_test = df_test.groupby(group_name)[group_name].count().to_numpy()
    # Keeping only the features on which we would test our model
    X_test = df_test.drop([group_name, label_name], axis = 1)
    # Relevance label for test
    y_test = df_test[label_name].astype(int)
    
    return X_train, y_train, qids_train, X_test, y_test, qids_test, training, testing

X_train, y_train, qids_train, X_test, y_test, qids_test, training_ids, testing_ids = train_test_split(
                                                                                    df, 'group_id', 'label')

Above code, is used to split your dataset.

# list of feature columns that we will take while training 
# (provide explicit column names in your case)

imp_feats = [str(i) for i in range(1, 41)]

# provide hyper-parameters as per your use case
model = xgboost.XGBRanker(
                    alpha=alpha[0],
                    booster=booster[0],
                    colsample_bytree = colsample_bytree[0],
                    eta = eta[0],
                    gamma = gamma[0],
                    reg_lambda = reg_lambda[0],
                    max_depth = max_depth[0],
                    min_child_weight=min_child_weight[0],
                    n_estimators= n_estimators[0],
                    objective=objective[0],
                    seed=seed[0],
                    subsample=subsample[0],
                    )
        
model.fit(X_train[imp_feats], y_train, 
          eval_set=[(X_train[imp_feats], y_train),
                    (X_test[imp_feats], y_test)], 
          group=qids_train, 
          eval_group = [qids_train, qids_test], 
          eval_metric = ["ndcg@10"], # calculate ndcg at 10 as north start, while training 
          early_stopping_rounds=100,
          verbose=True)

And voila! You can now put your ranking model to the test by calculating NDCG on the newer scores versus the original item positions. Feel free to extend this to calculate metrics for your use case.

That’s all for this quick introduction into the world of rankers. Hopefully, these insights will help you embark on your journey of exploring ranker algorithms with confidence and humor!

Don’t hestitate to ping for some help. Happy ranking!

Introduction to Learning to Rank

How is ranking different from traditional ML problems?

Pairwise Ranking

LambdaMART

Written by Shunya Vichaar

Responses (1)