Revelio
Machine-Learning classifier identifying bots in software projects, integrable with GrimoireLab
Summary
Revelio is the name of tool developed for my MSc final project of the Master's degree in Data Science (Rey Juan Carlos University).
This tool uses a Machine-Learning model to detect automatic accounts (bots) automatically in software projects,
based on their profiles' information and their activity in that software project, integrable as a component inside the GrimoireLab toolchain.
With this goal in mind, we analysed the code changes from a set of software projects from the Wikimedia Foundation,
produced between January 2008 and September 2021 using GrimoireLab, labelling manually the bot accounts
generating activity with the purpose of creating an input dataset to train a binary classifier
to detect whether a given profile is a bot or not.
After testing different classification models using the Scikit-learn module for Python, the
model that performed best was a “Random Forest” classifier, where the most relevant features
were a terms score calculated based on domain-related heuristics and statistical values obtained
from the individuals’ activity, such as number of changes in source code or
number of words and files per code change submitted to the projects.
To learn more about the data analysis process, and the classification
experiments, visit the Notebooks section.