I am updating my parameters every iteration with a batch from a very large file. But before I do this I want to split the entire large dataset in a test and a train set. And with crossvalidation I want to do the same.
I have tried to use dask to split the entire set and then transform a partion to pandas to use batches for updating my algorithm.
the dask part (which I if possible would rather not use):
dict_bag=dff.read_csv("gdrive/My Drive/train_triplets.txt", blocksize=int(1e9),sep='\s+',header=None)
df_train, df_test = df_bag.random_split([2/3, 1/3], random_state=0)
df_batch=df_train.loc[1:1000].compute()
the pandas part:
df_chunk = pd.read_csv("gdrive/My Drive/train_triplets.txt", chunksize=6000000,sep='\s+',header=None)
for chunk in df_chunk:
#### here I have my algorithm
I expect that it is possible to have a pandas function to create a pd file with chunksizes from a url as I already have but then split in a train and a test set. So that I can iterate in batches over the large train and test set individually. And so that I can also split the trainset for me to perform crossvalidation.
Aucun commentaire:
Enregistrer un commentaire