Walmart Kaggle Competition by kaslemr

The Walmart Data Science Competition

Everyone wants to better understand their customers. With the availability of amazing quantities of data from new avenues such as social media as well as traditional avenues such as transactions, it is often difficult to separate the signal from the noise.

How can we extract meaning from so much information?

One way is to use machine learning, or predictive analytics. For example, Walmart uses machine learning to classify the different types of trips that people take to their stores. Customer classification can help Walmart improve store layout, better target promotions through apps, or analyze buying trends.

As a recruitment competition on Kaggle, Walmart challenged the data science community to recreate their trip classification system using only limited transactional data. This could help Walmart innovate and improve upon their machine learning processes.

Walmart provided over 600,000 rows of training data, meaning data already labeled with the corresponding trip classification. The challenge would be to train models to predict upon data without the trip classifications, and Walmart graded participants on their accuracy. My models would be able to predict the correct trip classification with 72% accuracy.

Here's a snapshot of the initial dataset:

data = pd.read_csv("train.csv")
print(data.head())

TripType - 1 of 38 different trip classifications created by Walmart, the "answer"
Visit Number - A unique identifier for a particular trip, with ~90,000 trips in all
Weekday
UPC - barcode number
ScanCount - The number of that particular item purchased, with a return signified as a negative
Department Description - 1 of 72 item categories created by Walmart
Fineline Number - 1 of 5000 numbers used to further classify items

When competing in Kaggle competitions or doing any data science project for that matter, it is important to use a structured approach.

Feature preparation refers to the process of transforming raw data into data that machine learning models can read and learn from. You then must run this data through your model, improve performance by optimizing the model's parameters, and finally apply this learned model to test data without classifications. I repeated this process tenfold in optimizing my model's performance.

Data Analysis - Uncovering the Mystery Behind Walmart's Trip Classification Strategy

Before beginning this machine learning process, I wanted to gain some basic understanding of Walmart's data, and specifically the Trip Types that we would need to predict upon. It is always easier to create great features when you know have a story you want to tell your model.

A great tool to understanding many categories of data in a single glance is a heat map.

a4_dims = (13, 9)
fig, ax = plt.subplots(figsize=a4_dims)
seaborn.heatmap(ax=ax, data=mytestgrouped_categories_norm.T, linecolor='lightgrey', linewidths=.001)
heatmap = ax.pcolor(mytestgrouped_categories_norm, cmap=plt.cm.Blues, alpha=0.8)
ax.invert_yaxis()
ax.xaxis.tick_top()
ax.set_yticks(np.arange(mytestgrouped_categories_norm.shape[1]) + 1, minor=False)
ax.set_xticks(np.arange(mytestgrouped_categories_norm.shape[0]) + 1, minor=False)
plt.xticks(rotation=90)
plt.rc('xtick', labelsize=10)
plt.title('TripType',y=1.04)

You can see many correlations between Trip Types and one or two very popular items.

I also created bar graphs for each trip type to explore these associations in more detail.

data = pd.read_csv("train.csv")
type_6 = data[data.TripType == 6]
type_6_items = type_6[["TripType","DepartmentDescription"]]
type_6_items.DepartmentDescription.value_counts().head().plot(kind="bar", rot=45, `
                                                         `title="Type 6 Trips", color="midnightblue")

I discovered that the most important aspect of Walmart's classification were the types of items purchased, with many trips signifying a specific cause, from pet food to party items to men's clothes.

I could even label many of the 38 trips for my own entertainment:

Feature Preparation

Creating useful data features for a model to learn from is perhaps the most important part of the data science process. Often, domain expertise can help this process. Based on my knowledge of the data, I created the following features, or variables, for each store visit:

Numerical Weekday Variable

data['Weekday'] = data['Weekday'].map({"Monday": 1, "Tuesday": 2, "Wednesday": 3, "Thursday": 4, "Friday": 5,
"Saturday": 6, "Sunday": 7})

74 Department Description Variables

Then, I wanted to describe the purchased item in a numerical way. The best way would be to create a column for each of the Department Description types with the data being the amount purchased in that category per trip. Creating columns for every category of data in a column like 'DepartmentDescription' is called creating dummy variables for those values:

dummies = pd.get_dummies(data.DepartmentDescription)
data[dummies.columns] = dummies

By default, dummy variables will contain boolean values (0 or 1, with 1 meaning the instance is true), but I instead wanted the values to represent the number of items purchased, so I multiplied the result by the Scan Count:

data_dummies = data_dummies.apply(lambda x: x*data["ScanCount"])

Return Variable

I then created a new feature called Return - signifying whether there was a return with a 1 = a return and 0 = no return.

data.loc[data.ScanCount < 0, 'Return'] = 1
data.loc[data.Return != 1, 'Return'] = 0

Grouping Data

Machine learning models are typically structured to make predictions or classifications on individual rows in a table. This presented a problem - each row in the Walmart data set represented a single item, not a complete store visit. As often comes up in data science, I needed to summarize many rows of items data into a single row that would encapsulate the meaning of a particular trip. I used Python's groupby function, and grouped some columns by their max value and others by summing the rows:

grouped_data = data.groupby("VisitNumber")
grouped_data = grouped.agg({'Weekday': np.max, "TripType": np.max, 'NumItems': np.sum, 'Return': np.max, 
          '1-HR PHOTO': np.sum, 'ACCESSORIES': np.sum, '(All Other Department Descriptions)': np.sum...})

The data now looked like this:

Category Counts

My next variable described the total number of unique Department purchases made on each trip. I thought this would be another important variable signifying whether this was an multi-purpose or singular purpose trip.

def add_category_counts(data):
    alist = []
    for array in np.asarray(data.iloc[:,4:]):
        count = 0
        for item in array:
            if item > 0:
                count += 1
        alist.append(count)
    cat_counts = pd.DataFrame(alist)
    cat_counts = cat_counts.rename(columns={0:"CategoryCount"})
    cat_counts = cat_counts.set_index(data.index)
    data.insert(4, 'CategoryCounts', cat_counts)
    return data

Fineline Number Dummy Variables

Lastly, I created dummy variables for the most frequent Fineline numbers to limit the size of the dataframe. I later used all of the Fineline Numbers by creating a sparse matrix, a tool from the scikit-learn machine learning library that minimizes the memory required for a dataframe with a large amount of 0 values. The data was simply too large to use a pandas dataframe as before. In all, I had a dataframe with around 90,000 rows x 5000 columns.

Modeling - Applying Machine Learning Techniques

I tried many different machine learning models throughout the process. I went through logistic regression, Naive Bayes, Random Forest, Extra Trees, and others before landing on the XGBoost library, which produced superior results. The XGBoost library creates multiple decision tree ensembles and averages the results, halting at the best fit. To see the code for the other models, please refer to the project's Github page.

import xgboost as xgb from sklearn.cross_validation import train_test_split from sklearn.metrics import log_loss

Before modeling, it is important to split your training data into a training set and a test set, the latter of which hides the answers from the model. This way, it is easy to see how the model performs before unleashing it on brand new data without answers.

mytrain, mytest = train_test_split(data, test_size = .4)

dtrain = xgb.DMatrix(np.asarray(mytrain[features]), label = np.asarray(mytrain.TripType))
dtest = xgb.DMatrix(np.asarray(mytest[features]), label = np.asarray(mytest.TripType))

Setting Paramaters

The XGBoost library provides many customizable parameters to optimize the model for specific circumstances. For instance, it is possible to either output a single prediction for each visit or the probability of every trip type for that particular visit. Since Walmart was using a logloss score as its scoring metric, it was necessary to output the probability of each trip type.

num_round = 200
param = {'objective': 'multi:softprob', 'num_class':38, 
     'eval_metric': 'mlogloss', "max_delta_step": 5}
watchlist = [(dtrain,'train'), (dtest, 'eval')]

Training the Model

XGBoost helps prevent overfitting with it's early_stopping_round parameter, which halts the model after it has not improved for more than x number of rounds. Training a model can take anywhere from a few seconds to multiple hours depending on the complexity of an ensemble and size of the data. These rounds usually took a couple minutes.

bst = xgb.train(param, dtrain, num_round, watchlist, 
            early_stopping_rounds=3)

In this case, the model trained for 90 iterations before stopping at the 87th iteration. It's log loss was .847, which translated to an accuracy of 72% when taking the most probable Trip Type for each prediction set.

Predicting Upon Test Data and Formatting Data to CSV for Competition Submission

test_predictions = lr.predict_proba(np.asarray(test_data[test_features]))

def predictions_to_csv(test_predictions):
    test_predictions = pd.DataFrame(test_predictions)
    test_indexes = test.index
    test_predictions.insert(0, 'VisitNumber', test_indexes)
    return test_predictions.to_csv("submissions/fifth_fineline_xgb.csv", index=False)

Analyzing Results

As you can see below, machine learning models do much better predicting upon frequent occurrences, and struggle with classifications that are more rare. This is a well-known problem that can really only addressed by gathering more data.

I hope this has helped you better understand the machine learning process, and if you are interested, helps you compete in a Kaggle data science competition. You can see the current active competitions at kaggle.com!

Walmart Kaggle Competition

How I Achieved a Top 25% Score in the Walmart Classification Challenge