AI student competition 2022: Fake news detection

Detecting fake news in Hungarian news articles and tweets

This year the goal of the annual student competition in Artificial Intelligence at the Faculty of Electrical Engineering and Informatics was to detect fake news entries in short articles written in Hungarian.

Organizer: Artificial Intelligence Research Group, Department of Measurement and Information Systems.

Date and venue: 23rd of November, 2022. Online (Teams).

Data

In order to narrow the focus of the challenge we collected documents related to the Covid pandemic. It was a rather hard task to assemble a big-enough validated data set in Hungarian language that allows training of various kinds of statistical and machine learning models. As machine translation techniques were vastly improved during the past decade we decided to collect validated English language texts and to translate them into Hungarian.

The core corpus was created from two validated English-language fake news data sets and the translation was done using Deepl that performs surprisingly good in English-to-Hungarian translation. Then we enhanced this corpus with texts from selected Hungarian newspapers and online sources.

The resulting corpus contains roughly 10k documents (half of them labelled as fake news). The average length of the documents is around 630 characters.

Task

The task was to develop an offline system to detect fake news in Hungarian language and to maximize a given set of performance metrics.
The Python programming environment was preferred with the usual data science, machine learning and NLP libraries.

Results

The best performer achieved 99% accuracy (AUPRC) by refining a pre-trained Hungarian BERT model.
Simple bag-of-words models with SVM / PassiveAgressive classifiers, and Spacy pre-trained with Hungarian texts with SVM performed around 95% if proper pre-processing (filtering out stop words, stemming) was applied.