Revelio

Machine-Learning classifier identifying bots in software projects, integrable with GrimoireLab

MSc Thesis

Notebooks

Slides

Code Repository

Summary

Revelio is the name of tool developed for my MSc final project of the Master's degree in Data Science (Rey Juan Carlos University).

This tool uses a Machine-Learning model to detect automatic accounts (bots) automatically in software projects, based on their profiles' information and their activity in that software project, integrable as a component inside the GrimoireLab toolchain.

With this goal in mind, we analysed the code changes from a set of software projects from the Wikimedia Foundation, produced between January 2008 and September 2021 using GrimoireLab, labelling manually the bot accounts generating activity with the purpose of creating an input dataset to train a binary classifier to detect whether a given profile is a bot or not.

After testing different classification models using the Scikit-learn module for Python, the model that performed best was a “Random Forest” classifier, where the most relevant features were a terms score calculated based on domain-related heuristics and statistical values obtained from the individuals’ activity, such as number of changes in source code or number of words and files per code change submitted to the projects.

To learn more about the data analysis process, and the classification experiments, visit the Notebooks section.

Revelio

Machine-Learning classifier identifying bots in software projects, integrable with GrimoireLab

MSc Thesis

Notebooks

Slides

Code Repository

Summary

General architecture