lundi 26 octobre 2020

Checking model overfit of doc2vec with infer_vector()

my aim is to create document embeddings from the column df["text"] as a first step and then as a second step plug them along with other variables into a XGBoost Regressor model in order to make predictions. This works very well for the train_df.
I am currently trying to evaluate my trained Doc2Vec model by inferring vectors with infer_vector() on the unseen test_df and then again make predictions with it.However, the results are super bad. I got a very large error (RMSE). I assume, this means that Doc2Vec is massively overfitting? I am actually not sure if this is the correct way to evaluate my doc2vec model (by infer_vector)? What to do to prevent doc2vec from overfitting?

Please find my code below for infering vectors from a model:

vectors_test=[]
for i in range(0, len(test_df)):
    vecs=model.infer_vector(tokenize(test_df["text"][i]))
    vectors_test.append(vecs)
vectors_test= pd.DataFrame(vectors_test)
test_df = pd.concat([test_df, vectors_test], axis=1)

I then make predictions with my XGBoost model:

np.random.seed(0)
test_df= test_df.reindex(np.random.permutation(test_df.index))

y = test_df['target'].values
X = test_df.drop(['target'], axis=1).values

y_pred = mod.predict(X)
pred = pd.DataFrame()
pred["Prediction"] = y_pred
rmse = np.sqrt(mean_squared_error(y,y_pred))
print(rmse)

Please see also the training of my doc2vec model:

doc_tag = train_df.apply(lambda train_df: TaggedDocument(words=tokenize(train_df["text"]), tags= [train_df.Tag]), axis = 1)

# initializing model, building a vocabulary 

model = Doc2Vec(dm=0, vector_size=200, min_count=1, window=10, workers= cores) 

model.build_vocab([x for x in tqdm(doc_tag.values)])

# train model for 5 epochs 

for epoch in range(5): 
    model.train(utils.shuffle([x for x in tqdm(doc_tag.values)]), total_examples=len(doc_tag.values), epochs=1)

Aucun commentaire:

Enregistrer un commentaire