(updated version from 28.02)
There are no teaching assistant in this course, but the course team will give supervision and it is possible to help each other on slack.
The commercial says: Slack is a collaboration hub where you and your team can work together to get things done.
Key features (for MA8701) are organized conversations and searchable history, so you ask questions related to the MA8701 course/exercise, and team members answer - all can see the conversation and learn - and the history is searchable.
Slack is used in many companies If you have used Slack before, you know how this works, and if you have use Microsoft teams, that is a competitor (so then you know the type of service).
We have set up our workspace to allow from members signing up with email addresses ending in @ntnu.no or @stud.ntnu.no, you you need to sign in with such an email.
It is voluntary if you what to use slack - it is an offer - and we are rather sure that you will learn a lot if we get our workspace to be an active venue for the course.
General questions related to R: Mette.Langaas@ntnu.no in 1236, sentralbygg 2, Gløshaugen. General questions related to Python: Benjamin.Dunn@ntnu.no in 1326, sentralbygg 2, Gløshaugen.
Erlend will be available for supervision on data science and GPU usage in general when he is in Trondheim (he is based in Oslo).
and go to the Avito Demand prediction Challenge and accept the rules (or else no download of data)
https://www.kaggle.com/c/avito-demand-prediction
and make a note of the fact that
the submission to the challenge was a file with all the item_ids of the test data and your deal_probability for these (see below), where the deal_probabilities were in the range [0,1].
In the challenge the the root mean squared error on you test data predictions (deal_probabiity) was used as score https://www.kaggle.com/c/avito-demand-prediction#evaluation
To see how well your method performs we need to calculate the RMSE on test observations.
You have two choices, either you 1) use observations in the last 25% rows the training data and label that as test data or 2) make a submission at Kaggle with the their test data.
For both of these solutions you should in addition set a side a validation set for checking out how your model performs under way (the amount of data is vast so there should be no need for cross-validation).
The size of the training set (train.csv below) is 1503424 adverticements. Then you use observation in rows (1:1127568) for training and observations (1127569:1503424) for test.
Updated: you do not need to use all observations in the training set for training (you might take a smaller subset) and you do not need to use all of these observations as a test test, you may use a random subset if you do not want to spend time running large analyses.
Put away the test set and keep the deal_probability of these test cases “in the vault”. Do not look at these observations when you work on your model. Then, when all is good, apply your models/methods to the test cases and then calculate the MSE for the models/methods that you want to present. Report the MSE in your final presentation.
Updated: the important here is that the same test set is used for all methods presented.
If you choose this option, you will not need to download any of the test files from kaggle (see below).
Make a submission at Kaggle from https://www.kaggle.com/c/avito-demand-prediction/submit
You need to make your group into a team first https://www.kaggle.com/c/avito-demand-prediction/team
and then submit your predictions - with the required format of your file - 508438 prediction rows. This file should have a header row. See the sample submission file above.
Report the score you get from Kaggle as part of your project presentation.
https://www.kaggle.com/c/avito-demand-prediction/data
There are several files for download - and each of them are very big (look at the file size before trying to download). The following is copied from the Data decription from Kaggle:
Same schema as the train data, minus deal_probability. 508438 item_id’s
Same schema as the train data minus deal_probability, image, and image_top_1.
Same schema as the train data minus deal_probability, image, and image_top_1.
Same schema as periods_train.csv, except that the item ids map to an ad in test_active.csv.
These are the exact same images as you’ll find in train_jpg.zip but split into smaller zip archives so the data are easier to download. If you already have train_jpg.zip you do NOT need to download these. We have not made these zips available in kernels as they would only increase the kernel creation time.
If you want to download parts of the data you may either just download form the web-page (Data) or install the official Kaggle API https://github.com/Kaggle/kaggle-api. The latter requires that you have Python 3 and pip installed, and use you Kaggle account. There is no API available for R.
Useful commands:
If you don’t get this to work: have you confirmed that you have accepted the rules?
You should not download the image part of the data, as those files are large (see file sizes above) and are already available at epic. See Erlends presentation from 29.02.2019: https://github.com/Froskekongen/MA8701, where ou see that the data are available on the epic cluster on /lustre1/projects/fs_ma8701_1/avito
(only for those enrolled in the course who also have accepted the avito challenge rules).
https://www.kaggle.com/c/avito-demand-prediction/kernels
Most kernels are in Python, but a few using R.
Good kernels:
We found this very useful (minus the xgboost part) both for exploratory data analysis and for proprocessing of the train.csv-data: https://www.kaggle.com/ganeshn88/detailed-eda-nlp-xgboost-model
In the presentation we want you to show results from applying regularized regression to the data.
The important part here is the regularized regression, and you do not need to use the full data set here - and you may then be able to work on your laptop (and not yet use epic).
Some ideas on what to think about:
deal_probability
or a transformation thereof?Text analysis is a whole subject in itself, and and for R-users the following teaching corner is a good read: https://kenbenoit.net/pdfs/text_analysis_in_R.pdf. The following can be read from that article on the preprocessin of text:
Bag-of-words approached use the DTM or the TF-iDF version in analyses.
Here is an example in python (by Ben).
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import pandas as pd, numpy as np
import re, string, random
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LassoCV
from sklearn.utils import parallel_backend
from sklearn.model_selection import train_test_split
# T is the length of data you want to run right now (the full dataset takes forever
# mdf is for setting the min_df value in the TfidfVectorizer function (google is good) -- when building the vocabulary for back of words ignore terms that have a document frequency strictly lower than min_df
T = 30000
mdf = 50
# get data from csv files
data = pd.read_csv('train.csv', usecols=['description', 'deal_probability'])
desc = (data['description'])
Y = (data['deal_probability'])
data = 0
# break up data into train and test data
traindesc, testdesc, trainY, testY = train_test_split(desc, Y, test_size=0.25, random_state=23)
# shrink training data to T
traindesc = traindesc[:T]
trainY = trainY[:T]
# Replace nans with spaces
traindesc.fillna(" ", inplace=True)
testdesc.fillna(" ", inplace=True)
## Get "bag of words" transformation of the data -- see example in Lasso book discussed in class
## also: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
vec = TfidfVectorizer(ngram_range=(1,1), min_df=mdf, max_df=0.9, lowercase=True, strip_accents='unicode', sublinear_tf=True)
trainX = vec.fit_transform(traindesc)
testX = vec.transform(testdesc)
# fit lasso model
with(parallel_backend('threading')):
m = LassoCV(cv=5, verbose=True).fit(trainX, trainY)
# show results for fit data
plt.figure()
ax = plt.subplot(111)
plt.plot(m.alphas_, m.mse_path_, ':')
plt.plot(m.alphas_, m.mse_path_.mean(axis=-1), 'k', label='Average across the folds', linewidth=2)
plt.axvline(m.alpha_, linestyle='--', color='k', label='CV estimate')
ax.set_xscale('log')
plt.legend()
plt.xlabel('$\lambda$')
plt.ylabel('MSE')
plt.axis('tight')
plt.savefig('lasso_path.png')
# show the terrible predictions
testYpred = m.predict(testX)
plt.figure()
plt.plot(testY, testYpred, '.', alpha=0.1)
plt.title('RMSE: %f'%np.sqrt(np.mean( (testYpred - testY)**2 )))
plt.savefig('lasso_prediction.png')
Here is a long example in R (by Mette - report errors to Mette.Langaas@ntnu.no).
library(tidyverse)
library(lubridate)
library(magrittr)
library(text2vec)
library(tokenizers)
library(stopwords)
library(Matrix)
library(stringr)
library(stringi)
library(forcats)
library(glmnet)
set.seed(0)
# assuming that train.csv is downloaded an is in ./input
Sys.setlocale(locale="ru_RU") # show russian words
#---------------------------
# will not use test set, loading only training data
cat("Reading data...\n")
tr <- read_csv("./input/train.csv")
#---------------------------
cat("Preprocessing...\n")
# here you may add other stuff than this (the commented-out ones are from the recommended kernel)
trpre <- tr %>% mutate(no_img = is.na(image) %>% as.integer(),
no_dsc = is.na(description) %>% as.integer(),
# no_p1 = is.na(param_1) %>% as.integer(),
# no_p2 = is.na(param_2) %>% as.integer(),
# no_p3 = is.na(param_3) %>% as.integer(),
# titl_len = str_length(title),
# desc_len = str_length(description),
# titl_capE = str_count(title, "[A-Z]"),
# titl_capR = str_count(title, "[А-Я]"),
# desc_capE = str_count(description, "[A-Z]"),
# desc_capR = str_count(description, "[А-Я]"),
# titl_cap = str_count(title, "[A-ZА-Я]"),
# desc_cap = str_count(description, "[A-ZА-Я]"),
# titl_pun = str_count(title, "[[:punct:]]"),
# desc_pun = str_count(description, "[[:punct:]]"),
# titl_dig = str_count(title, "[[:digit:]]"),
# desc_dig = str_count(description, "[[:digit:]]"),
user_type = factor(user_type),
category_name = factor(category_name) %>% as.integer(),
parent_category_name = factor(parent_category_name) %>% as.integer(),
region = factor(region) %>% as.integer(),
# param_1 = factor(param_1) %>% as.integer(),
# param_2 = factor(param_2) %>% as.integer(),
# param_3 = factor(param_3) %>% fct_lump(prop = 0.00005) %>% as.integer(),
city = factor(city) %>% fct_lump(prop = 0.0003) %>% as.integer(), #lumping together uncommon factors
user_id = factor(user_id) %>% fct_lump(prop = 0.000025) %>% as.integer(),#lumping together userids not so common
price = log1p(price), # log(price+1)
txt = paste(title, description, sep = " "), # treating title and description together
mday = mday(activation_date), #day of the month
wday = wday(activation_date)) %>% # day of the week
select(user_id,region, city, parent_category_name,user_type,no_img,no_dsc,txt,mday,wday,deal_probability)
# replace_na(list(image_top_1 = -1, price = -1,
# param_1 = -1, param_2 = -1, param_3 = -1,
# desc_len = 0, desc_cap = 0, desc_pun = 0,
# desc_dig = 0, desc_capE = 0, desc_capR = 0)) %T>%
glimpse(trpre)
rm(tr)
gc()
#---------------------------
# how to represent the txt using bag of words (from Part 1)
cat("Parsing text...\n")
it <- trpre %$%
str_to_lower(txt) %>%
str_replace_all("[^[:alpha:]]", " ") %>%
str_replace_all("\\s+", " ") %>%
tokenize_word_stems(language = "russian") %>%
itoken()
str(it)
# then it is a
vect <- create_vocabulary(it, ngram = c(1, 1), stopwords = stopwords("ru")) %>%
prune_vocabulary(term_count_min = 3, doc_proportion_max = 0.4, vocab_term_max = 12500) %>%
vocab_vectorizer()
str(vect)
m_tfidf <- TfIdf$new(norm = "l2", sublinear_tf = T)
tfidf <- create_dtm(it, vect) %>%
fit_transform(m_tfidf)
str(tfidf)
# tf=term frequency
#Creates TfIdf(Latent semantic analysis) model.
#The IDF is defined as follows: idf = log((# documents in the corpus) / (# documents where the term appears + 1))
# Tfidf =tf*idf
rm(it, vect, m_tfidf); gc()
#---------------------------
cat("Preparing data...\n")
# design matrix for the tfidf-part of the data
Xrest <- trpre %>%
select(-txt,-deal_probability) %>%
sparse.model.matrix(~ . - 1, .)
idtest=1127569:1503424
idrest=1:1127569
# go for 1e5 training samples and 1e5 validation samples - and 1e5 fake test set
# the true test is kept in vault now, and not looked at for a while!
set.seed(8701)
randtrain=sample(idrest,1e5)
randvalid=sample(setdiff(idrest,randtrain),1e5)
randtest=sample(setdiff(idrest,union(randtrain,randvalid)),1e5)
# alternatively the same user should not be split?
Xtr=Xrest[randtrain,]
Ytr=trpre[randtrain,]$deal_probability
Xval=Xrest[randvalid,]
Yval=trpre[randvalid,]$deal_probability
Xtest=Xrest[randtest,]
Ytest=trpre[randtest,]$deal_probability
tfidftr=tfidf[randtrain,]
tfidfval=tfidf[randvalid,]
tfidftest=tfidf[randtest,]
#---------------------------
cat("Training model...\n")
fit=glmnet(x=Xtr,y=Ytr) #standardize=TRUE default, not include intercept (is already included)
# since we have all these data I want to use the validation set to choose the
# optimal lambda, not the cv.glmnet -therefore just loop over the lambdas
lambdas=fit$lambda
rmse=rep(NA,length.out=length(lambdas))
for (i in 1:length(lambdas))
{
print(i)
thislambda=lambdas[i]
yhats=predict(fit,newx=Xval,type="response",s=thislambda)
rmse[i]=sqrt(mean((Yval-yhats)^2))
}
plot(lambdas,rmse)
# OLS is the best with these predictors - so, this was a test just to check that
# things are working before going on to the tfidf
fit=glmnet(x=tfidftr,y=Ytr,standardize = FALSE) #since weighted?
lambdas=fit$lambda
rmse=rep(NA,length.out=length(lambdas))
for (i in 1:length(lambdas))
{
print(i)
thislambda=lambdas[i]
yhats=predict(fit,newx=tfidfval,type="response",s=thislambda)
rmse[i]=sqrt(mean((Yval-yhats)^2))
}
plot(lambdas,rmse)
bestlambda=lambdas[which.min(rmse)]
yhattest=predict(fit,newx=tfidftest,s=bestlambda)
testrmse=sqrt(mean((Ytest-yhattest)^2))
testrmse # not really winning anything here, but, a good start :-)
plot(Ytest,yhattest,pch=20) #ups...
# lots to check out next, transforming the Y? and adding other covs?
More on glmnet:
In the presentation we want you to show results from applying deep learning to the data. This need not involve images, but could.
This will happen in Part 4: Deep learning.
What do you want to try out in addition to the regularized regression and deep learning? You are not required to do more methods, but maybe you have been inspired by some of the kernels?
The presentation will last for 10-20 minutes (dependent on the group size), and all group members must prsent. The presentation should have focus on the methods, and could include information on: