“Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.” - John F. Woods

    Popularity of Mashable Articles

    Github Octocat Code

    For my Intro to Machine Learning class, I had to devise a problem to run unsupervised learning algorithms to see how each one fared. From the Machine Learning repository at UCI I chose Popularity of Online News article dataset. This problem seems interesting because it has lot of practical uses and is challenging at the same time.

    A lot of variables effect which articles are popular and which are not. The content and style of the article along with the sociocultural and political atmosphere at the time also affect it's popularity. Popularity itself can also be defined in various ways for example, how long the articles remain relevant, how many people read the articles, how often it is shared in social media or how many comments does it generate compared to other articles. For this dataset, the data collecters had chosen popularity to mean how many shares it generated in social media. It is a good quantitative metric which was collected for over two years.

    The following are some of the attributes of the dataset:

    url, timedelta, n_tokens_title, n_tokens_content, n_unique_tokens, n_non_stop_words, n_non_stop_unique_tokens, num_hrefs, num_self_hrefs, num_imgs, num_videos, average_token_length, num_keywords, data_channel_is_lifestyle

    Two examples of the dataset:, 731, 9, 531, 0.503787878, 0.999999997, 0.665634673, 9, 0, 1, 0, 4.404896422, 731, 10, 370, 0.559888578, 0.999999995, 0.698198195, 2, 2, 0, 0, 4.359459459

    The target value is the number of shares in the social media which ranges from 1 to 800K so, instead of using the raw values I used standard normalization to fit all of the shares into a smaller range which goes from -0.29194 to 72.23. Hence, a demarcation has been created where articles that have standard shares below -0.225 are unpopular and above 0.1 are popular and those in between are neutral.

    Initially, the data had a huge number of neutral dataset which caused the classifiers to over-fit for the neutral class. Hence, I did a random removal of some of the neutral data for training so that all the classes were equally represented in the training set. This caused the accuracy of the classifiers to increase by more than 15%.

    Unpopular Neutral Popular Total No. Features
    Training 3303 2756 3438 9497 59
    Testing 1166 873 1127 3166 59
    Holdout Set 1148 870 1148 3166 59

    Out, of the 15K data 60% data has been used for training, 20% has been set out for testing and the other 20% has been set as a holdout set which is used for 1-fold cross validation. The dataset has been dummy coded with values of 0 for Unpopular, 1 for Neutral and 2 for popular.

    Decision Tree Classifier

    DecisionTreeClassifier from Scikit-learn was used to train and do a three way classification. Initially, the tree was expanded to fit all the data which created a tree of height of 31 and a baseline accuracy of 63.9%. After that the tree was pruned iteratively from 33 to 2 to figure out which gave the best possible accuracy which took about 10 sec. The tree of height 6 gave the best accuracy score on the holdout set of 72%.

    Decision Tree Complexity Decision Tree Accuracy

    Max height of 6 was used to create the final classifier and used for testing and cross validation. From the cross validation graph we can see that as the number of data is increased accuracy on training data decreases whereas accuracy on testing set increases. It suggests that as more data is included the accuracy of the classifier should increase.


    Adaboost classifier from Scikit-learn was used for ensemble learning to boost the DecisionTreeClassifier. There are two major hyper-parameters to modify depth of the decision tree and number of trees used. I have plotted the depth vs. time and n_estimators vs time complexity graph. It seems that they are directly proportional hence, increasing the depth and number of estimators takes a lot longer to fit and predict but, also gives better prediction.

    Boosting Complexity Boosting Time Depth

    Hence, a tree with depth of 7 and 10,000 estimators were used to train the AdaBoost classifier. It took about 30 mins to run and predict. It gave an accuracy of 75% on the testing set which is better than all the other algorithms. The following graph shows the learning curve for the boosted classifier and the accuracy metrics:

    Boosting Accuracy

    As we can see on the graph above, the classifier has been over-fit with the training data hence very high accuracy on the training set. But, the testing set accuracy has also increased by a little which suggests that feeding more data to train would increase the accuracy.

    Neural Network

    MultilevelPerceptron from Weka was used to run the neural network with number of hidden nodes as 30 using backpropagation to find the weights. The number of hidden layer was chosen based on the number of input attributes which is 57. This classifier was run for various epoch which gave different accuracy rate which has been plotted below:

    Neural Learning Neural Complexity

    From the above complexity curve we can see that increasing the number of epochs gives better accuracy but after 400 epochs accuracy decreases. Hence, 400 epoch were used to generate the cross validation learning curve for the neural network. The model was used to find a cross-validation error which gave an accuracy of 68% using all the training data. We can see from the learning curve that as number of training samples is increased the training accuracy increases which suggests the model is being over fitted

    Support Vector Machine

    SVC module from Scikit-learn was used to predict the three way classification of popularity for online articles. I have compared linear and polynomial kernel by switching them out. I have compared penalty or ‘C’ factor for linear kernels and different degree polynomials for the polynomial kernel.

    SVM Linear Learning SVM Poly Learning

    We can see that for linear kernel Penalty ‘C’ of 1.5 gives the best accuracy of 72.5 % on the holdout set. Similarly, for the polynomial kernel the best accuracy of 73.6% is given by the polynomial of degree 3. This information is used to create the best classifier and do a cross validation and testing:

    SVM Linear Learning SVM Poly Learning

    Looking at the graph above and the metrics we can safely conclude that linear kernel is better by few percentage points and could be improved by adding more data. The learning and testing score both increase when adding more data

    K-Nearest Neighbors

    By running the training set, on a default KNeighborsClassifier with the default nearest neighbors of 5 it gave an accuracy of 66% and the dataset with 25 best features also gave an accuracy of 66% on the holdout set. To figure out the optimum number of neighbors the number of neighbors were iteratively increased and the accuracy of the hold-out set was calculated. 43 nearest neighbors gave the highest accuracy for the holdout set with all 59 attributes included whereas 12 nearest neighbor gave the best result for dataset with only 25 attributes included which were 68.5% and 68.3% respectively.

    25 KNN Complexity All KNN Complexity

    The following chart show the accuracy against the number of neighbors for both the sets with all the attributes and the cross validation learning curves.

    25 KNN Complexity All KNN Complexity

    We can see that holding out attributes does not make a difference in the accuracy of the prediction and the training time is also only different by 10 sec. Hence, using all the features is better.


    Popularity prediction of online news article can be done with 75% accuracy based on the ensemble boosting method. However, it takes a lot of time to train and test. Increasing the number of estimators increases the accuracy of the classifier but, by a very small margin as can be seen by going from 5000 estimators to 10,000 increased the accuracy by 1% whereas the training time increased by a whole lot.

    It can also be seen that Decision Trees give similar results but, with very little training time. Decision trees correctly classified the tweets 73% of the time with very little overhead. Using only the best 25 features also increased the testing time because decision tree also uses information gain for feature selection.

    Neural network did not perform on par with other algorithms. It actually performed worse than other classifiers. It could be because there are some noise in the data which is causing it to misrepresent some of the weights. There are some articles on this dataset that are hyper popular which could be causing the weights to be non-optimal.

    Linear SVM also performed on par with the Decision Tree algorithm and it was very fast. It gave an accuracy of 73% on the training set which could mean that the data is linearly separable. KNN gave an average score of 69% with all the features and with only 25 features however, it performed the same meaning the data points that are in similar distance did not change based on the number of features.


    Caruana, Rich, and Alexandru Niculescu-Mizil. "Data mining in metric space: an empirical analysis of supervised learning performance criteria." Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004.

    Hearn, Alison. "Structuring feeling: Web 2.0, online ranking and rating, and the digital ‘reputation’economy." ephemera 10.3/4 (2010): 421-438.

    Tatar, Alexandru, et al. "From popularity prediction to ranking online news."Social Network Analysis and Mining 4.1 (2014): 1-12.

    Online News Popularity. (n.d.). Retrieved September 21, 2015.

    Github Octocat Check out the code