The problem that the neural network is trying to solve is to classify Mashable articles into one of popular, neutral, or unpopular groups. This is the same problem as discussed in supervised learning of Mashable articles.

Here, I am using three different random optimization algorithms to try and find a good weight for the neural network and see how each one performs in classifying the Mashable articles. All of the experiments are run through a JAVA package called ABAGAIL. I forked the code into my repository and ran the following builtin optimization algorithms:Initially, the number of iterations were changed on the various optimization algorithms while keeping the data set the same. The iterations were changed from 200 all the way to 6000 in order to see if the accuracy score of the neural network would improve. We can see the accuracy score of each of the optimizations below

### Back Propagation

MultilevelPerceptron from Weka was used to run the neural network with number of hidden nodes as 30 using backpropagation to find the weights. The number of hidden layer was chosen based on the number of input attributes which is 57. This classifier was run for various epoch which gave different accuracy rate which has been plotted below:

### Random Hill Climbing

We can see from the graphs below that the accuracy of the Neural Network did not increase linearly with the increase in the number of iterations. Another interesting thing to notice is that the lower number of iterations gave better test score than more number of iterations. Random Hill Optimization with less than 1000 iterations gave an accuracy score of 57% whereas increasing the number of iterations did not increase the accuracy of the network.

We can conclude that the increasing the number of iterations does not increase the testing accuracy however, the time it takes to run each simulation increases significantly with the number of iterations. Therefore, by keeping the number of iterations constant and changing the number of training data the learning curves were obtained.

With random hill climbing methods, as the number of sample size is increased both the training and testing error decreases significantly. But, with increasing sample size we can see that the testing accuracy increases which means that the random optimization is able to find good weights. However, the weights for neural network are not as optimal as given by the back propagation algorithm. This could have been improved by getting the average score based on many runs instead of just using one single run. The drawback for this is that it is computationally expensive and takes a lot of time.

### Simulated Annealing

Simulated annealing optimization gives much better accuracy than the other two algorithms. The testing accuracy increases with the number of sample size as the bias of classifier decreases. It seems that providing even more data would have increased the accuracy much further. It could be explained by the fact that instead of only doing hill climbing it also does some hill descending. It gave the best accuracy of about 57.5%. It is better than other three algorithms but, not as good as the back propagation algorithm to train the neural network. Similarly, for Simulated Annealing optimizations the testing accuracy does not increase with the increase in number of iterations.

### Genetic Algorithm

Genetic algorithm does not show much variation in the testing accuracy with the number of sample size. However, we can see that the training accuracy decreases much faster than the testing accuracy which seems that the network trained with the genetic algorithm is not learning the training set itself but, figuring out the data set. We can see that the training accuracy decreases with the sample size which means the data was over-fit and now it’s an under-fit. However, even with the genetic algorithm the accuracy is only about 57% whereas with the back propagation algorithm the accuracy was about 73%. Interestingly, for genetic algorithm the testing accuracy actually decreases with the increase in accuracy.

### Conclusion

The poor performance of the optimization algorithms could be explained by stochastic errors where the starting point of each algorithm is randomly selected. Other factors could have been the presence of local optima where the algorithms are getting stuck. Each of the algorithms seem to be fixed on the accuracy of around 57% which suggests that the weight that have been converged to are the same. The subpar performance of the Genetic Algorithm can be explained by The Fundamental Theorem of Genetic Algorithms which assumes an infinitely large population. In this experiment we had to fix the population size such that the algorithms complete in a finite amount of time and the lack of very large dataset also could have been a hurdle.

### References:

Baluja, Shumeet, and Rich Caruana. "Removing the genetics from the standard genetic algorithm." Machine Learning: Proceedings of the Twelfth International Conference. 1995.

Isbell Jr, Charles L. "Randomized Local Search as Successive Estimation of Probability Densities." A longer tutorial version of the 1997 paper on MIMIC that includes a derivation for MIMIC with trees. Can be accessed at http://www. cc. gatech. edu/∼ isbell/tutorials/mimic-tutorial. pdf.

Chen, Luonan, and Kazuyuki Aihara. "Chaotic simulated annealing by a neural network model with transient chaos." Neural networks 8.6 (1995): 915-930.

Caruana, Rich, and Alexandru Niculescu-Mizil. "Data mining in metric space: an empirical analysis of supervised learning performance criteria." Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004.

Online News Popularity. (n.d.). Retrieved September 21, 2015. http://archive.ics.uci.edu/ml/datasets/Online+News+Popularity