Commit 93ac9a2d authored by root's avatar root
Browse files

Modeling.ipynb

parent 5fd2d13d
%% Cell type:code id:39dde399-29f5-4c8e-ae7f-99217ddfeed0 tags:
``` python
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold
from sklearn.utils import shuffle
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn.model_selection import cross_validate, cross_val_predict
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
```
%% Cell type:markdown id:680c531b-7a71-4704-8d1d-6105eca178d3 tags:
The modeling portion of the project went very well for us. After doing some research on good models for text classification, we decided to delve into Naive Bayes, Stochastic Gradient Descent, and Support Vector Classifier. On the initial runthrough of models, we came out with accuracies of:
NB: 0.70
SGD: 0.74
SVC: 0.757
Naive Bayes was the worst of the three, especially when trying to look at the confusion matrix. It had hardly correctly predicted any score other than 5, whereas SGD and SVC saw more admirable results.
The hardest part of this project was finetuning parameters and crashes. Even though we sampled the initial >5,000,000 reviews, the CountVectorization still was such a large object that it did not make classification testing any faster. Especially with the SVC, the kernel when used as a large degree polynomial or a sigmoid could take up to 20 minutes to finish classification. SGD also got progressively slower as the learning_rate got larger.
Crashing and restarting of the kernel was not a challenge, but just a setback. Every day when beginning to work again, and on the occasional crash, the initial loading of the dataset could take anywhere from 20-30 minutes which was just a pain.
The other main challenge, that I do not even consider much of a challenge because of its enjoyment, was trying to find new models to use. What preloaded models out there are already proficient at the task we are trying to accomplish? How do we use those? What do they do differently? How can we try to improve those? These were all questions we had to answer throughout this modeling stage.
For the final steps, we will take a look at our highest accuracy classifier more in depth. This classifier turned out to be SGD, which improved up to 0.763.
%% Cell type:markdown id:fc6f03f6-65b4-4bb0-bc33-7d740cd194d4 tags:
### Load in the dataset and drop the unnecessary columns
%% Cell type:code id:64477c21-ed58-4d13-806e-2c04d5cf375e tags:
``` python
df = pd.read_json(r'../data/Grocery_and_Gourmet_Food.json',lines=True)
```
%% Cell type:code id:3edb4829-8785-491a-91c8-063549ed0e3b tags:
``` python
df = df.drop(columns=['reviewTime','reviewerID','asin','reviewerName','unixReviewTime','vote','image','style'])
```
%% Cell type:markdown id:d0b4c630-e4b9-42ce-981d-99aa283a04db tags:
### Dataset contains over 5 million rows. Cannot handle any sort of computations with that, so we sample it to a size we can use
%% Cell type:code id:4304a47d-5e2f-426c-9b80-bde79660127e tags:
``` python
df_test = df.sample(frac=0.015, replace=True, random_state=10000)
```
%% Cell type:markdown id:971d25f3-6061-4aea-b1d6-7ab695e8d932 tags:
### Removing the very basic words (and, the, or, etc.) and some common misspellings
%% Cell type:code id:afc9be75-6e42-4b31-b7d7-a2b4fecd6a8d tags:
``` python
basic_words = ['the',' the ','and','it','to','of','this','in', 'is','was','as','wh','tte', 'on', 'th',' or ', ' at']
```
%% Cell type:code id:cd57445a-382e-456c-8fbe-bd738c71c48c tags:
``` python
for i in basic_words:
df_test.reviewText = df_test.reviewText.astype('str').apply(lambda x: x.replace(i, ' '))
```
%% Cell type:code id:f2d06d23-a238-4f12-b4a2-ab4d60f440a8 tags:
``` python
#https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
cv = CountVectorizer(min_df=0, lowercase=True)
vectorizer = cv.fit_transform(df_test['reviewText'].values.astype('U'))
```
%% Cell type:code id:2ba7fc89-83ea-457c-af3a-2de2414fd065 tags:
``` python
vectorizer.toarray().shape
```
%%%% Output: execute_result
(76112, 27488)
%% Cell type:code id:83601929-b72f-41c6-9e34-3ab907eb1c33 tags:
``` python
tfidf_transformer=TfidfTransformer(smooth_idf=True,use_idf=True)
tfidf_transformer.fit(vectorizer)
```
%%%% Output: execute_result
TfidfTransformer()
%% Cell type:code id:51c77a04-0825-4e71-bfa7-b6f30503de69 tags:
``` python
df_idf = pd.DataFrame(tfidf_transformer.idf_, index=cv.get_feature_names(),columns=["idf_weights"])
```
%% Cell type:markdown id:536a9f2f-263b-41aa-b9b1-770f7a6b8c1a tags:
# Classification
%% Cell type:code id:9ab8a3cc-decd-47f7-89ce-ab06a283ab38 tags:
``` python
X = df_test.reviewText.astype('str')
y = df_test.overall
```
%% Cell type:markdown id:a9e81b57-319e-42fe-83ae-5763aba99900 tags:
### MultinomialNB
%% Cell type:code id:e5caa8db-bcee-4ff0-b1c9-a60aed571eb0 tags:
``` python
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.3, random_state=0, shuffle=True)
#https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
text_clf = Pipeline([ ('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB(alpha= 0.01)), ])
text_clf.fit(X_train,y_train)
predicted = text_clf.predict(X_test)
np.mean(predicted == y_test)
```
%%%% Output: execute_result
0.7306647981080845
%% Cell type:markdown id:630ecce8-768b-42ca-8fc0-1d7b2b761b9f tags:
### alpha:
%% Cell type:markdown id:05546eb8-7668-4c70-833e-cf862bdf904b tags:
##### 1 (default) --> .70
##### .1 --> .730
##### .01 --> .731
##### .001 --> .727
%% Cell type:code id:cd364e16-9450-4d0a-a9bb-4d0a9e6166da tags:
``` python
print(classification_report(y_test, predicted))
```
%% Cell type:code id:c8f3f4a8-0cd6-4652-8816-e0a8bb0ed364 tags:
``` python
print(confusion_matrix(y_test, predicted))
```
%% Cell type:markdown id:bec6de18-128a-440c-8507-82850e2c454d tags:
### Stochastic Gradient Descent
%% Cell type:code id:35c82dd1-93cf-49f7-86cb-8b48b2faa33c tags:
``` python
#https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
text_clf = Pipeline([ ('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=1e-5, random_state=42,
max_iter=1000, tol=None, eta0 = 0.001,
learning_rate = 'constant')),])
text_clf.fit(X_train,y_train)
predicted = text_clf.predict(X_test)
np.mean(predicted == y_test)
```
%%%% Output: execute_result
0.7634229657528248
%% Cell type:markdown id:f5068574-1e1a-4d7b-ad30-24800a2ec0ca tags:
### Alpha Ranges:
##### [1e-8, 1e-10] --> .70
##### 1e-7 --> .717
##### 1e-6 --> .727
##### 1e-5 --> .76
##### 1e-4 --> .749
%% Cell type:markdown id:edda8117-1ea7-470c-b499-7ac80881e29b tags:
### max_iter:
%% Cell type:markdown id:7006ec4f-5878-4c8a-955d-17c4021bc882 tags:
##### max_iter = 1000 (default) --> .762
##### max_iter = 500 --> .762
##### max_iter = 10 --> .761
##### max_iter = 5 --> .76
%% Cell type:markdown id:70bcca82-d177-4b1a-a06d-f86c917359b8 tags:
### learning_rate:
%% Cell type:markdown id:f11a44b4-4980-4f9a-b6db-1e72678808e0 tags:
##### optimal (default) --> .762
##### constant (eta0 = .001) --> .763
%% Cell type:code id:701c6887-133e-4ec5-be1b-de6b67f03101 tags:
``` python
print(classification_report(y_test, predicted))
```
%% Cell type:code id:e273735b-79d5-49b7-aba5-671879f942ae tags:
``` python
print(confusion_matrix(y_test, predicted))
```
%% Cell type:markdown id:3b083b03-eba6-4597-af3e-58ce574c42ec tags:
### Support Vector Classifier
%% Cell type:code id:417731f2-5de7-4473-b5c9-e6b5e1e01928 tags:
``` python
#https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47
text_clf = Pipeline([ ('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SVC(kernel='linear')), ])
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.75, random_state=0, shuffle=True)
text_clf.fit(X_train,y_train)
predicted = text_clf.predict(X_test)
np.mean(predicted == y_test)
```
%%%% Output: execute_result
0.7568320369981081
%% Cell type:code id:0ddb56ec-5ff9-4276-9d95-884c323a278d tags:
``` python
print(classification_report(y_test, predicted))
```
%% Cell type:code id:8852da08-c2f9-4e58-ac10-b5ee20f05d79 tags:
``` python
print(confusion_matrix(y_test, predicted))
```
%% Cell type:markdown id:6d206a56-628f-4866-85ef-8aac08a93349 tags:
### kernel:
%% Cell type:markdown id:43fccb6d-0eb0-4d8a-ba21-d1ff9f7c7789 tags:
##### linear: .757
##### poly (degree 2) --> .735
##### poly (degree 3) --> .712
##### poly (degree 5) --> .71
##### sigmoid --> .744
%% Cell type:code id:36a0bad6-484b-4a36-a67a-0e049898210a tags:
``` python
text_clf = Pipeline([ ('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SVC(kernel='poly', degree=3)), ])
text_clf.fit(X_train,y_train)
predicted = text_clf.predict(X_test)
print(np.mean(predicted == y_test))
print(classification_report(y_test, predicted))
print(confusion_matrix(y_test, predicted))
```
%% Cell type:code id:cbd28132-daa1-4a3b-a451-de42abec2979 tags:
``` python
#https://www.svm-tutorial.com/2014/10/svm-linear-kernel-good-text-classification/
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment