mercredi 24 avril 2019

How to use split large dataset in train/test set but also use pandas batchsize itererations for updating

I am updating my parameters every iteration with a batch from a very large file. But before I do this I want to split the entire large dataset in a test and a train set. And with crossvalidation I want to do the same.

I have tried to use dask to split the entire set and then transform a partion to pandas to use batches for updating my algorithm.

the dask part (which I if possible would rather not use):

dict_bag=dff.read_csv("gdrive/My Drive/train_triplets.txt",  blocksize=int(1e9),sep='\s+',header=None)
df_train, df_test = df_bag.random_split([2/3, 1/3], random_state=0)
df_batch=df_train.loc[1:1000].compute()

the pandas part:

df_chunk = pd.read_csv("gdrive/My Drive/train_triplets.txt", chunksize=6000000,sep='\s+',header=None)
for chunk in df_chunk:

####    here I have my algorithm 

I expect that it is possible to have a pandas function to create a pd file with chunksizes from a url as I already have but then split in a train and a test set. So that I can iterate in batches over the large train and test set individually. And so that I can also split the trainset for me to perform crossvalidation.

Aucun commentaire:

Enregistrer un commentaire