mercredi 30 janvier 2019

Normalize test data (using IDF)

I would like to know how to train unseen (test) data to be passed through a model I have created that works well.

Testdata (unseen) has been trained following same steps as classification model, except for IDF.

I understand I have to normalize testdata, else I receive an error when trying to pass the testdata through the trained model.

The model created was normalized as follows (after data split):

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(strip_accents='unicode', analyzer='word', ngram_range=(1,3), norm='l2')
vectorizer.fit(train_text)
vectorizer.fit(test_text)

x_train = vectorizer.transform(train_text)
y_train = train.drop(labels = ['id','comment_text'], axis=1)

x_test = vectorizer.transform(test_text)
y_test = test.drop(labels = ['id','comment_text'], axis=1)

and here is the model

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier


%%time 

# Using pipeline for applying logistic regression and one vs rest classifier 
LogReg_pipeline = Pipeline([ 
                           ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)), 
]) 

for category in categories: 
printmd('**Processing {} comments...**'.format(category))

# Training logistic regression model on train data 
LogReg_pipeline.fit(x_train, train[category])

# calculating test accuracy 
prediction = LogReg_pipeline.predict(x_test) 
print('Test accuracy is {}'.format(accuracy_score(test[category], prediction))) 
print("\n")

I would like to know how to prepare the testdata to be able to pass it through the model.

As when I pass the data though the model as it is:

prediction = LogReg_pipeline.predict(x_test) 

I get an error message.

Thank you

Josep Maria

Aucun commentaire:

Enregistrer un commentaire