testing: Python - NLTK train/test split

vendredi 13 juillet 2018

Python - NLTK train/test split

I've been following SentDex's video regarding NLTK and Python, and have constructed a script which determines review-sentiment using various models, e.g. logistic regression. My worry is that I think SentDex's approach includes the test-set while determining words to be used for training, which is obviously not preferable (train/test split occurs after feature-selection).

I've tried methods to remove the test-set (see end of post), however have been unsuccessful. How can I remove the test-set from the feature-selection process? Code:

# READ IN FILES AS DF (same format as nltk.movie_reviews)
reader = CategorizedPlaintextCorpusReader('C:/.../Data/amazon_reviews_processed', r'.*\.txt', cat_pattern=r'(\w+)/*', encoding='latin1')
documents = [ (list(reader.words(fileid)), category)
            for category in reader.categories()
            for fileid in reader.fileids(category) ]

# CREATE FEATURE-LIST
all_words = []

for w in reader.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(documents):
    words = set(documents)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]

# RANDOMIZE LIST AND DETERMINE TRAIN/TEST
np.random.shuffle(featuresets)
training_set = featuresets[:8000]
testing_set = featuresets[8000:]

Already tried, yields the error "AttributeError: 'list' object has no attribute 'items'":

reader = CategorizedPlaintextCorpusReader('C:/Users/Laurie Bamber/Documents/Network Tool/Functions/Sentiment/Data/amazon_reviews_processed', r'.*\.txt', cat_pattern=r'(\w+)/*', encoding='latin1')
documents = [ (list(reader.words(fileid)), category)
             for category in reader.categories()
             for fileid in reader.fileids(category) ]

np.random.shuffle(documents)

training_set = documents[:8000]
testing_set = documents[8000:]

all_words = []

for w in reader.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(training_set):
    words = set(training_set)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

featuresets = [(find_features(rev), category) for (rev, category) in training_set]

np.random.shuffle(featuresets)

training_set = featuresets
testing_set = testing_set

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

I'm an amateur at Python and quite confused about this problem (For example I don't know if 'for w in reader.words:' also needs changed). Any help would be appreciated.

testing

vendredi 13 juillet 2018

Python - NLTK train/test split

Aucun commentaire:

Enregistrer un commentaire