(updated version from 28.02)

Supervision

There are no teaching assistant in this course, but the course team will give supervision and it is possible to help each other on slack.

Slack

The commercial says: Slack is a collaboration hub where you and your team can work together to get things done.

Key features (for MA8701) are organized conversations and searchable history, so you ask questions related to the MA8701 course/exercise, and team members answer - all can see the conversation and learn - and the history is searchable.

Slack is used in many companies If you have used Slack before, you know how this works, and if you have use Microsoft teams, that is a competitor (so then you know the type of service).

  1. Head over to join our ntnu-stats workspace by pressing the invitation:

https://join.slack.com/t/ntnu-stats/shared_invite/enQtNTQ1ODMwMzQ0OTMzLTM3MGEyNDYyN2NkYmE3ZWYyMzBkOTJmZmVjODc1OWY3NTVkNjY3Zjc4NTZlZmM0NTNiMTBiMjQyZDI1NmFhZjk

We have set up our workspace to allow from members signing up with email addresses ending in @ntnu.no or @stud.ntnu.no, you you need to sign in with such an email.

  1. After you have signed in, you can by default take a tour of Slack, or skip that. When you come into our slack workspace you see other members (that you may direct message -instead of emailing them), and ask and answer questions at our #ma8701 channel.

It is voluntary if you what to use slack - it is an offer - and we are rather sure that you will learn a lot if we get our workspace to be an active venue for the course.

Questions about R and Python

General questions related to R: Mette.Langaas@ntnu.no in 1236, sentralbygg 2, Gløshaugen. General questions related to Python: Benjamin.Dunn@ntnu.no in 1326, sentralbygg 2, Gløshaugen.

Erlend will be available for supervision on data science and GPU usage in general when he is in Trondheim (he is based in Oslo).

Get familiar with the Avito Demand Prediction data set

Create a kaggle account

https://www.kaggle.com/

and go to the Avito Demand prediction Challenge and accept the rules (or else no download of data)

https://www.kaggle.com/c/avito-demand-prediction/rules

Read about the aims of the challenge

https://www.kaggle.com/c/avito-demand-prediction

and make a note of the fact that

  1. the aim is to predict demand for an online advertisement
  2. based on the following information on the advertisements:
  • full description (title, description, images, etc.),
  • context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts
  • the full description includes:
    • the text part of the advertisement (in russian)
    • the image part of the advertisement
  1. the submission to the challenge was a file with all the item_ids of the test data and your deal_probability for these (see below), where the deal_probabilities were in the range [0,1].

  2. In the challenge the the root mean squared error on you test data predictions (deal_probabiity) was used as score https://www.kaggle.com/c/avito-demand-prediction#evaluation

How to finalize the project?

To see how well your method performs we need to calculate the RMSE on test observations.

You have two choices, either you 1) use observations in the last 25% rows the training data and label that as test data or 2) make a submission at Kaggle with the their test data.

For both of these solutions you should in addition set a side a validation set for checking out how your model performs under way (the amount of data is vast so there should be no need for cross-validation).

1) Define test data yourself

The size of the training set (train.csv below) is 1503424 adverticements. Then you use observation in rows (1:1127568) for training and observations (1127569:1503424) for test.

Updated: you do not need to use all observations in the training set for training (you might take a smaller subset) and you do not need to use all of these observations as a test test, you may use a random subset if you do not want to spend time running large analyses.

Put away the test set and keep the deal_probability of these test cases “in the vault”. Do not look at these observations when you work on your model. Then, when all is good, apply your models/methods to the test cases and then calculate the MSE for the models/methods that you want to present. Report the MSE in your final presentation.

Updated: the important here is that the same test set is used for all methods presented.

If you choose this option, you will not need to download any of the test files from kaggle (see below).

2) Make a submission at Kaggle

Make a submission at Kaggle from https://www.kaggle.com/c/avito-demand-prediction/submit

You need to make your group into a team first https://www.kaggle.com/c/avito-demand-prediction/team

and then submit your predictions - with the required format of your file - 508438 prediction rows. This file should have a header row. See the sample submission file above.

Report the score you get from Kaggle as part of your project presentation.

The data

https://www.kaggle.com/c/avito-demand-prediction/data

Files for download

There are several files for download - and each of them are very big (look at the file size before trying to download). The following is copied from the Data decription from Kaggle:

train.csv (zip 308MB, unzipped on mac 953MB)- Train data. [THIS one you surely need]

  • item_id - Ad id.
  • user_id - User id.
  • region - Ad region.
  • city - Ad city.
  • parent_category_name - Top level ad category as classified by Avito’s ad model.
  • category_name - Fine grain ad category as classified by Avito’s ad model.
  • param_1 - Optional parameter from Avito’s ad model.
  • param_2 - Optional parameter from Avito’s ad model.
  • param_3 - Optional parameter from Avito’s ad model.
  • title - Ad title.
  • description - Ad description.
  • price - Ad price.
  • item_seq_number - Ad sequential number for user.
  • activation_date- Date ad was placed.
  • user_type - User type.
  • image - Id code of image. Ties to a jpg file in train_jpg. Not every ad has an image.
  • image_top_1 - Avito’s classification code for the image.
  • deal_probability - The target variable. This is the likelihood that an ad actually sold something. It’s not possible to verify every transaction with certainty, so this column’s value can be any float from zero to one.

test.csv (zip 107MB, unzipped on mac 331MB) - Test data [you do NOT need this unless you want to submit an deal_probability guess to kaggle]

Same schema as the train data, minus deal_probability. 508438 item_id’s

train_active.csv (zip 3GB)- Supplemental data from ads that were displayed during the same period as train.csv. [these are not necessary to use]

Same schema as the train data minus deal_probability, image, and image_top_1.

test_active.csv (zip 2GB) - Supplemental data from ads that were displayed during the same period as test.csv. [these are not necessary to use]

Same schema as the train data minus deal_probability, image, and image_top_1.

periods_train.csv (zip 170MB)- Supplemental data showing the dates when the ads from train_active.csv were activated and when they where displayed. [these are not necessary to use]

  • item_id - Ad id. Maps to an id in train_active.csv. IDs may show up multiple times in this file if the ad was renewed.
  • activation_date - Date the ad was placed.
  • date_from - First day the ad was displayed.
  • date_to - Last day the ad was displayed.

periods_test.csv (zip 136MB)- Supplemental data showing the dates when the ads from test_active.csv were activated and when they where displayed. [these are not necessary to use]

Same schema as periods_train.csv, except that the item ids map to an ad in test_active.csv.

train_jpg.zip (49GB) - Images from the ads in train.csv. [wait with this, since that is discussed in Part 4 on Deep learning]

test_jpg.zip (19GB) - Images from the ads in test.csv. [you do NOT need this unless you want to submit an deal_probability guess to kaggle]

sample_submission.csv (8MB) - A sample submission in the correct format. [you do NOT need this unless you want to submit an deal_probability guess to kaggle]

train_jpg_{0, 1, 2, 3, 4}.zip (10GB each) [same as above]

These are the exact same images as you’ll find in train_jpg.zip but split into smaller zip archives so the data are easier to download. If you already have train_jpg.zip you do NOT need to download these. We have not made these zips available in kernels as they would only increase the kernel creation time.

Downloading data

If you want to download parts of the data you may either just download form the web-page (Data) or install the official Kaggle API https://github.com/Kaggle/kaggle-api. The latter requires that you have Python 3 and pip installed, and use you Kaggle account. There is no API available for R.

Useful commands:

  • kaggle competitions files -c avito-demand-prediction
  • kaggle competitions download avito-demand-prediction -f train.csv.zip

If you don’t get this to work: have you confirmed that you have accepted the rules?

You should not download the image part of the data, as those files are large (see file sizes above) and are already available at epic. See Erlends presentation from 29.02.2019: https://github.com/Froskekongen/MA8701, where ou see that the data are available on the epic cluster on /lustre1/projects/fs_ma8701_1/avito (only for those enrolled in the course who also have accepted the avito challenge rules).

Getting inspired by exploring kernels

https://www.kaggle.com/c/avito-demand-prediction/kernels

Most kernels are in Python, but a few using R.

R

We found this very useful (minus the xgboost part) both for exploratory data analysis and for proprocessing of the train.csv-data: https://www.kaggle.com/ganeshn88/detailed-eda-nlp-xgboost-model

First step: Regularized regression

In the presentation we want you to show results from applying regularized regression to the data.

The important part here is the regularized regression, and you do not need to use the full data set here - and you may then be able to work on your laptop (and not yet use epic).

Some ideas on what to think about:

  • What should the reponse in the regression be? The raw deal_probability or a transformation thereof?
  • How should you preprocess categorical values (see kernels above for hints)?
  • What about the title and discription? Should you use that? See below for a simple toy example using bag-of-words from the Part 1 lecture.

Text analysis is a whole subject in itself, and and for R-users the following teaching corner is a good read: https://kenbenoit.net/pdfs/text_analysis_in_R.pdf. The following can be read from that article on the preprocessin of text:

  1. Tokenization: the process of splitting text into tokens. That may be words, which are called unigrams.
  2. Normalization: lowercasing and stemming. A word might have different morphological variations (plural, verb conjugation), which is handled by only looking at the word stem.
  3. Removing stopwords: uninteresting words like “and” can be removed.
  4. Document term matrix (DTM): this is a common form of representing a text corpus (=collection of texts). The DTM has one row for each document and one row for each term, and cells indicate how often each term occurs in the document.
  5. Filtering and weighting: The DTM may be weighted with an estimated of the information value in the word, and a populare weighting scheme is term frequency inverse document frequency, which downweights terms that occur in many documents.

Bag-of-words approached use the DTM or the TF-iDF version in analyses.

Here is an example in python (by Ben).

import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

import pandas as pd, numpy as np
import re, string, random

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LassoCV
from sklearn.utils import parallel_backend
from sklearn.model_selection import train_test_split


# T is the length of data you want to run right now (the full dataset takes forever
# mdf is for setting the min_df value in the TfidfVectorizer function (google is good) -- when building the vocabulary for back of words ignore terms that have a document frequency strictly lower than min_df
T = 30000
mdf = 50

# get data from csv files
data = pd.read_csv('train.csv', usecols=['description', 'deal_probability'])
desc = (data['description'])
Y = (data['deal_probability'])
data = 0

# break up data into train and test data 
traindesc, testdesc, trainY, testY = train_test_split(desc, Y, test_size=0.25, random_state=23)

# shrink training data to T
traindesc = traindesc[:T]
trainY = trainY[:T]

# Replace nans with spaces
traindesc.fillna(" ", inplace=True)
testdesc.fillna(" ", inplace=True)

## Get "bag of words" transformation of the data -- see example in Lasso book discussed in class 
## also: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
vec = TfidfVectorizer(ngram_range=(1,1), min_df=mdf, max_df=0.9, lowercase=True, strip_accents='unicode', sublinear_tf=True)
trainX = vec.fit_transform(traindesc)
testX = vec.transform(testdesc)

# fit lasso model
with(parallel_backend('threading')):
  m = LassoCV(cv=5, verbose=True).fit(trainX, trainY)

# show results for fit data
plt.figure()
ax = plt.subplot(111)
plt.plot(m.alphas_, m.mse_path_, ':')
plt.plot(m.alphas_, m.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2)
plt.axvline(m.alpha_, linestyle='--', color='k', label='CV estimate')
ax.set_xscale('log')
plt.legend()
plt.xlabel('$\lambda$')
plt.ylabel('MSE')
plt.axis('tight')
plt.savefig('lasso_path.png')

# show the terrible predictions
testYpred = m.predict(testX)
plt.figure()
plt.plot(testY, testYpred, '.', alpha=0.1)
plt.title('RMSE: %f'%np.sqrt(np.mean( (testYpred - testY)**2 )))
plt.savefig('lasso_prediction.png')

Here is a long example in R (by Mette - report errors to Mette.Langaas@ntnu.no).

library(tidyverse)
library(lubridate)
library(magrittr)
library(text2vec)
library(tokenizers)
library(stopwords)
library(Matrix)
library(stringr)
library(stringi)
library(forcats)
library(glmnet)
set.seed(0)

# assuming that train.csv is downloaded an is in ./input

Sys.setlocale(locale="ru_RU") # show russian words

#---------------------------
# will not use test set, loading only training data

cat("Reading data...\n")
tr <- read_csv("./input/train.csv") 

#---------------------------
cat("Preprocessing...\n")

# here you may add other stuff than this (the commented-out ones are from the recommended kernel)

trpre <- tr %>% mutate(no_img = is.na(image) %>% as.integer(),
         no_dsc = is.na(description) %>% as.integer(),
         # no_p1 = is.na(param_1) %>% as.integer(), 
         # no_p2 = is.na(param_2) %>% as.integer(), 
         # no_p3 = is.na(param_3) %>% as.integer(),
         # titl_len = str_length(title),
         # desc_len = str_length(description),
         # titl_capE = str_count(title, "[A-Z]"),
         # titl_capR = str_count(title, "[А-Я]"),
         # desc_capE = str_count(description, "[A-Z]"),
         # desc_capR = str_count(description, "[А-Я]"),
         # titl_cap = str_count(title, "[A-ZА-Я]"),
         # desc_cap = str_count(description, "[A-ZА-Я]"),
         # titl_pun = str_count(title, "[[:punct:]]"),
         # desc_pun = str_count(description, "[[:punct:]]"),
         # titl_dig = str_count(title, "[[:digit:]]"),
         # desc_dig = str_count(description, "[[:digit:]]"),
         user_type = factor(user_type),
         category_name = factor(category_name) %>% as.integer(),
         parent_category_name = factor(parent_category_name) %>% as.integer(), 
         region = factor(region) %>% as.integer(),
         # param_1 = factor(param_1) %>% as.integer(),
         # param_2 = factor(param_2) %>% as.integer(),
         # param_3 = factor(param_3) %>% fct_lump(prop = 0.00005) %>% as.integer(),
         city =  factor(city) %>% fct_lump(prop = 0.0003) %>% as.integer(), #lumping together uncommon factors
         user_id = factor(user_id) %>% fct_lump(prop = 0.000025) %>% as.integer(),#lumping together userids not so common
         price = log1p(price), # log(price+1)
         txt = paste(title, description, sep = " "), # treating title and description together
         mday = mday(activation_date), #day of the month
         wday = wday(activation_date)) %>%  # day of the week
  select(user_id,region, city, parent_category_name,user_type,no_img,no_dsc,txt,mday,wday,deal_probability)
  # replace_na(list(image_top_1 = -1, price = -1, 
  #                 param_1 = -1, param_2 = -1, param_3 = -1, 
  #                 desc_len = 0, desc_cap = 0, desc_pun = 0, 
  #                 desc_dig = 0, desc_capE = 0, desc_capR = 0)) %T>% 
  glimpse(trpre)

rm(tr)
gc()

#---------------------------
# how to represent the txt using bag of words (from Part 1)
cat("Parsing text...\n")

it <- trpre %$%
  str_to_lower(txt) %>%
  str_replace_all("[^[:alpha:]]", " ") %>%
  str_replace_all("\\s+", " ") %>%
  tokenize_word_stems(language = "russian") %>% 
  itoken()

str(it)

# then it is a 

vect <- create_vocabulary(it, ngram = c(1, 1), stopwords = stopwords("ru")) %>%
  prune_vocabulary(term_count_min = 3, doc_proportion_max = 0.4, vocab_term_max = 12500) %>% 
  vocab_vectorizer()

str(vect)

m_tfidf <- TfIdf$new(norm = "l2", sublinear_tf = T)
tfidf <-  create_dtm(it, vect) %>% 
  fit_transform(m_tfidf)

str(tfidf)
# tf=term frequency
#Creates TfIdf(Latent semantic analysis) model. 
#The IDF is defined as follows: idf = log((# documents in the corpus) / (# documents where the term appears + 1))
# Tfidf =tf*idf
  
rm(it, vect, m_tfidf); gc()

#---------------------------
cat("Preparing data...\n")
# design matrix for the tfidf-part of the data

Xrest <- trpre %>% 
    select(-txt,-deal_probability) %>% 
    sparse.model.matrix(~ . - 1, .) 

idtest=1127569:1503424
idrest=1:1127569

# go for 1e5 training samples and 1e5 validation samples - and 1e5 fake test set
# the true test is kept in vault now, and not looked at for a while!

set.seed(8701)
randtrain=sample(idrest,1e5)
randvalid=sample(setdiff(idrest,randtrain),1e5)
randtest=sample(setdiff(idrest,union(randtrain,randvalid)),1e5)
# alternatively the same user should not be split?

Xtr=Xrest[randtrain,]
Ytr=trpre[randtrain,]$deal_probability

Xval=Xrest[randvalid,]
Yval=trpre[randvalid,]$deal_probability

Xtest=Xrest[randtest,]
Ytest=trpre[randtest,]$deal_probability

tfidftr=tfidf[randtrain,]
tfidfval=tfidf[randvalid,]
tfidftest=tfidf[randtest,]

#---------------------------
cat("Training model...\n")

fit=glmnet(x=Xtr,y=Ytr) #standardize=TRUE default, not include intercept (is already included)

# since we have all these data I want to use the validation set to choose the 
# optimal lambda, not the cv.glmnet  -therefore just loop over the lambdas
lambdas=fit$lambda
rmse=rep(NA,length.out=length(lambdas))
for (i in 1:length(lambdas))
{ 
  print(i)
  thislambda=lambdas[i]
  yhats=predict(fit,newx=Xval,type="response",s=thislambda)
  rmse[i]=sqrt(mean((Yval-yhats)^2))
}
plot(lambdas,rmse)
# OLS is the best with these predictors - so, this was a test just to check that 
# things are working before going on to the tfidf


fit=glmnet(x=tfidftr,y=Ytr,standardize = FALSE) #since weighted?
lambdas=fit$lambda
rmse=rep(NA,length.out=length(lambdas))
for (i in 1:length(lambdas))
{ 
  print(i)
  thislambda=lambdas[i]
  yhats=predict(fit,newx=tfidfval,type="response",s=thislambda)
  rmse[i]=sqrt(mean((Yval-yhats)^2))
}
plot(lambdas,rmse)
bestlambda=lambdas[which.min(rmse)]

yhattest=predict(fit,newx=tfidftest,s=bestlambda)
testrmse=sqrt(mean((Ytest-yhattest)^2))
testrmse # not really winning anything here, but, a good start :-)
plot(Ytest,yhattest,pch=20) #ups...

# lots to check out next, transforming the Y? and adding other covs?

More on glmnet:

Second step: Deep learning

In the presentation we want you to show results from applying deep learning to the data. This need not involve images, but could.

This will happen in Part 4: Deep learning.

Third step: Methods of your choice

What do you want to try out in addition to the regularized regression and deep learning? You are not required to do more methods, but maybe you have been inspired by some of the kernels?

Final step: the oral presentation in weeks 14+15

The presentation will last for 10-20 minutes (dependent on the group size), and all group members must prsent. The presentation should have focus on the methods, and could include information on:

  • Which part of the data have you used and how has the data been preprocessed (not in detail)
  • Which statistical methods (from the reading list) have you used, and how? (Important!)
  • Did you also use other methods? Which methods and how.
  • Any interesting observations?
  • Can you break down which part of your solution contributed the most to reducing your test error?
  • Were there methods you thought would be very effective in reducing your test error but did not happen as expected?
  • How was you final result on the test set (MSE)? (Compare the 3+ methods you have used using the same test set, which can be a subset for the test set.)