For my Intro to Machine Learning class, I had to devise a problem to run unsupervised learning algorithms to
see how each one fared. From the Machine Learning repository at UCI I chose
Popularity of Online News article dataset.
This problem seems interesting because it has lot of practical uses and is challenging at the same time.
A lot of variables effect which articles are popular and which are not. The content and style of the article along with the
sociocultural and political atmosphere at the time also affect it's popularity.
Popularity itself can also be defined in various ways for example, how long the articles remain relevant,
how many people read the articles, how often it is shared in social media or how many comments does it
generate compared to other articles. For this dataset, the data collecters had chosen popularity to mean how many shares
it generated in social media. It is a good quantitative metric which was collected for over two years.
The following are some of the attributes of the dataset:
url, timedelta, n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens, num_hrefs, num_self_hrefs, num_imgs, num_videos, average_token_length, num_keywords, data_channel_is_lifestyle
Two examples of the dataset:
http://mashable.com/2013/01/07/astronaut-notre-dame-bcs/, 731, 9, 531, 0.503787878, 0.999999997, 0.665634673, 9, 0, 1, 0, 4.404896422
http://mashable.com/2013/01/07/beewi-smart-toys/, 731, 10, 370, 0.559888578, 0.999999995, 0.698198195, 2, 2, 0, 0, 4.359459459
The target value is the number of shares in the social media which ranges from 1 to 800K so, instead of using the raw values I
used standard normalization to fit all of the shares into a smaller range which goes from
-0.29194 to 72.23. Hence, a demarcation has
been created where articles that have standard shares
below -0.225 are unpopular and above 0.1 are popular and those in between are neutral.
Initially, the data had a huge number of neutral dataset which caused the classifiers to over-fit for the
neutral class. Hence, I did a random removal of some of the neutral data for training so that all the classes
were equally represented in the training set.
This caused the accuracy of the classifiers to increase by more than 15%.
Out, of the 15K data 60% data has been used for training, 20% has been set out for testing and the other 20% has been
set as a holdout set which is used for 1-fold cross validation.
The dataset has been dummy coded with values of 0 for Unpopular,
1 for Neutral and 2 for popular.
Decision Tree Classifier
DecisionTreeClassifier from Scikit-learn was used to train and do a three way classification.
Initially, the tree was expanded to fit all the data which created a tree of height of 31 and a baseline
accuracy of 63.9%. After that the tree was pruned iteratively from 33 to 2 to figure out which gave the
best possible accuracy which took about 10 sec. The tree of height 6 gave the best accuracy score on the holdout set of 72%.
Max height of 6 was used to create the final classifier and used for testing and cross validation. From the cross validation graph we can see that as the number of data is increased accuracy on training data decreases whereas accuracy on testing set increases.
It suggests that as more data is included the accuracy of the classifier should increase.
Adaboost classifier from Scikit-learn was used for ensemble learning to boost the
DecisionTreeClassifier. There are two major hyper-parameters to modify depth of the decision tree and number of trees used. I have plotted the depth vs. time and n_estimators vs time complexity graph. It seems that they are directly proportional hence, increasing the depth
and number of estimators takes a lot longer to fit and predict but, also gives better prediction.
Hence, a tree with depth of 7 and 10,000 estimators were used to train the AdaBoost classifier. It took about 30 mins to run and predict. It gave an accuracy of 75% on the testing set which is better than all the other algorithms.
The following graph shows the learning curve for the boosted classifier and the accuracy metrics:
As we can see on the graph above, the classifier has been over-fit with the training data hence very high accuracy on the training set. But, the testing set accuracy has also increased
by a little which suggests that feeding more data to train would increase the accuracy.
MultilevelPerceptron from Weka was used to run the neural network with number of hidden
nodes as 30 using backpropagation to find the weights. The number of hidden layer was chosen based on the number of input attributes which is 57. This classifier
was run for various epoch which gave different accuracy rate which has been plotted below:
From the above complexity curve we can see that increasing the number of epochs gives better accuracy but after 400 epochs accuracy decreases. Hence, 400 epoch were used to generate the cross validation learning curve for the neural network. The model was
used to find a cross-validation error which gave an accuracy of 68% using all the training data.
We can see from the learning curve that as number of training samples is increased the training
accuracy increases which suggests the model is being over fitted
Support Vector Machine
SVC module from Scikit-learn was used to predict the three way classification of popularity
for online articles. I have compared linear and polynomial kernel by switching them out. I have compared penalty
or ‘C’ factor for linear kernels and different degree polynomials for the polynomial kernel.
We can see that for linear kernel Penalty ‘C’ of 1.5 gives the best accuracy of 72.5 % on the
holdout set. Similarly, for the polynomial kernel the best accuracy of 73.6% is given by the polynomial of degree 3.
This information is used to create the best classifier and do a cross validation and testing:
Looking at the graph above and the metrics we can safely conclude that linear kernel is better by few percentage points and could be
improved by adding more data. The learning and testing score both increase when adding more data
By running the training set, on a default KNeighborsClassifier with the default nearest neighbors of 5 it gave an accuracy of 66% and the dataset with 25 best features also gave an accuracy of 66% on the holdout set. To figure out the optimum number of neighbors the number of neighbors were iteratively increased and the accuracy of the hold-out set was calculated. 43 nearest neighbors gave the highest accuracy for the holdout set with all 59 attributes included whereas 12 nearest neighbor gave the
best result for dataset with only 25 attributes included which were 68.5% and 68.3% respectively.
The following chart show the accuracy against the number of neighbors for both the sets with all the attributes and the
cross validation learning curves.
We can see that holding out attributes does not make a difference in the accuracy of the prediction and the training time is also only
different by 10 sec. Hence, using all the features is better.
Popularity prediction of online news article can be done with 75% accuracy based on the ensemble boosting method. However, it takes a lot of time to train and test. Increasing the number of estimators increases the accuracy of the classifier but, by a very small margin as can be seen by going from 5000 estimators to
10,000 increased the accuracy by 1% whereas the training time increased by a whole lot.
It can also be seen that Decision Trees give similar results but, with very little training time. Decision trees correctly classified the tweets 73% of the time with very little overhead. Using only the best 25 features also increased the testing time because decision tree also uses information gain for feature selection.
Neural network did not perform on par with other algorithms. It actually performed worse than other classifiers. It could be because there are some noise in the data which is causing it to misrepresent some of the weights. There are some articles on this dataset that are hyper popular which could be causing the weights to be non-optimal.
Linear SVM also performed on par with the Decision Tree algorithm and it was very fast. It gave an accuracy of 73% on the training set which could mean that the data is linearly separable. KNN gave an average score of 69% with all the features and with only 25 features however, it performed the same meaning the data points that are in similar distance did not change based on the number of features.
Caruana, Rich, and Alexandru Niculescu-Mizil. "Data mining in metric space: an empirical analysis of supervised learning performance criteria." Proceedings of the tenth ACM SIGKDD international conference on Knowledge
discovery and data mining. ACM, 2004.
Hearn, Alison. "Structuring feeling: Web 2.0, online ranking and rating, and the digital ‘reputation’economy." ephemera 10.3/4 (2010): 421-438.
Tatar, Alexandru, et al. "From popularity prediction to ranking online news."Social Network Analysis and Mining 4.1 (2014): 1-12.
Online News Popularity. (n.d.). Retrieved September 21, 2015. http://archive.ics.uci.edu/ml/datasets/Online+News+Popularity
Check out the code