Abstract
The objective of our project is to predict network traffic by mining news articles. The main slogan of the media 'give people what they want' gives us assumption articles will always reflect the most popular topics in the society at a given time, thus it is interesting to observe the relation between mainstream media and user behavior online. Under the influence of media, users browse videos related to the popular topics. Internet Service Provider (ISP) can take advantage of this trend and pre-cache highly popular videos to reduce the overall traffic and improve the customer experience. Topic modeling has been an active research area for the past few years. We chose to use latent Dirichlet allocation (LDA) for the purpose of evaluating topic popularity and frequent pattern mining algorithm to form the topic titles. After discovering popular topics, we can download and pre-cache related videos at the strategic nodes. The experimental results show our proactive pre-caching method can achieve better performance in terms of reducing the overall delay, compared to other conventional caching methods.
Introduction
The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local traffic if they apply proactive caching methods, by predicting future popular videos. The main slogan of the media "give people what they want" gives us the assumption mainstream news will always generate articles related to popular topics in society. Given such trend it is interesting to observe the relation between mainstream media and user behavior online. Under the influence of news, users browse videos related to the popular topics. The purpose of our study was to identify popular topics in the news articles, and pre-cache related videos at the strategic nodes to reduce the overall traffic.
Method
Topic modeling has been an active research area for the past few years. There are a number of tools available online for classifying and clustering topics from document set. We chose to use latent Dirichlet allocation (LDA) for the purpose of evaluating topic popularity. In order to form topic titles we select only article titles classified as part of a certain topic, and apply frequent pattern mining algorithm (Apriori) to detect frequent-2 / frequent-3 itemset.
Implementation
Several topic classification tools can be found on the web, we have downloaded Online LDA tool implemented in Python programming language. The original tool would download random Wikipedia articles and classify the topics. We have modified the original source code to download articles from specified sources. The list of sources is composed from the major news agencies. Each article is parsed to be handed to Online LDA tool. Final output consists of 100 topics with 53 words per topic. Words are sorted based on their appearance in the news articles. Finally topics are sorted by their popularity, and we query videos related to the topics.
Experimental Results
Online LDA alone accurately choses the most popular topic around 57% of the times using 1k articles. With 100k articles it is around 92% accurate. The accuracy using Online LDA combined with frequent pattern mining with 1k articles the accuracy is around 93%. Using 100k articles the accuracy is close to 100%. When using only Online LDA there is only around a 60% chance the selected video will be relevant to the actual topic when using 10k articles. When using 100k articles the probability rises to about 87%. When using frequent pattern mining and Online LDA there is around a 94% chance the video selected is relevant using 10k articles. With 100k the probability is close to 100%.
Conclusion
The objective of our project is to predict network traffic by mining news articles. The main slogan of the media "give people what they want" gives us the assumption articles will always reflect the most popular topics in the society at a given time. From these results we conclude that using LDA combined with frequent pattern mining will predict which videos will generate most traffic. Experimental results show our proactive caching method can achieve much better performance in terms of reducing the delay compared to other conventional methods.