Deployed Heroku app: https://flaireddit.herokuapp.com/
flaiReddit is a Reddit Flair Detector for subreddit r/india, that takes any post's URL as input and predicts the flair for the post using Machine Learning models. The web application for the same is hosted on Heroku at flaiReddit. The web-application also contains some useful data plots obtained after analysis of collected data.
The code has been developed using Python programming language, utilizing it's text processing and machine learning modules. The web application has been developed using Flask, HTML, CSS and hosted on Heroku web server.
The dependencies can be found in requirements.txt.
- app.py: Used to start up the Flask app.
- scrapeData.py: Used to scrape r/India posts from Reddit.
- training_models.py: Used to pre-process text and train various models. It was also used to analyse data by plotting trends.
- helper.py: Used to get predicted flair for given URL test.
- requirements.txt: Contains all dependencies for the project.
- nltk.txt: Contains NLTK library dependencies for deployment on Heroku.
- data: Contains CSV and JSON files of collected posts.
- templates: Contains HTML script for the web application.
- static: Contains images folder having the plots displayed on the web-application, obtained after data analysis.
- Open the
Terminal
. - Clone the repository by entering
git clone https://github.com/Jap-Leen/Reddit-Flair-Detector.git
. - Ensure that
Python3
andpip
is installed on the system. - Create a
virtualenv
by executing the following command:virtualenv venv
. - Activate the
venv
virtual environment by executing the follwing command:source venv/bin/activate
. - Enter the cloned repository directory and execute
pip install -r requirements.txt
. - Run
python app.py
from Terminal.
The python library PRAW has been used to scrape data from the subreddit r/india. 300 posts belonging to each of thee flairs were collected and analysed.
The following procedures have been executed on the title, body and comments to clean the data:
- Lowercasing
- Tokenizing and stemming
- Lemmatization
- Removing stopwords
Data so collected is stored as a MongoDB collection. Its JSON file can be found here.
The collected data is split as follows:
0.25% as Test Data and 0.75% as Training Data
Features of the posts like Title, Comment, Body and URL are used in various possible combinations and trained on three algorithms: Multinomial Naive Bayes, Linear SVM and Logistic Regression.
The model with highest accuracy score is saved and loaded for predicting the flair and the returned result is displayed on the web application.
The resulting scores for different stages of pre-processing, features and models can be found above.
The best accuracy score obtained was of 0.793248945147679. The features selected were the combination of Title, Body, Comments and URL. The model trained was Linear SVM. (Includes simple pre-processing, without stemming and lemmatization.)