The algorithm detects fake news based on the style

The algorithm detects fake news based on the style

Dr Eng. Is working on the algorithm, which based on the stylistic features of the information text, detects whether it is false or manipulated. Piotr Przybyła from the Polish Academy of Sciences. In this way, his team wants to detect not only fake news but also bots on social media.

Algorithms that detect manipulated or harmful content are not new. They are used, for example, by social media such as Facebook and Twitter. However, large corporations are reluctant to share information on how they work.


"There is no transparency on their part in this matter" - believes Dr Eng. Piotr Przybyła from the Institute of Computer Science of the Polish Academy of Sciences. The team under his leadership has built an algorithm that is largely innovative because until now scientists have focused on analyzing the truthfulness of the facts given in the content. Przybyła said that it is worth looking at the style of texts made available online, in the form of news articles and posts on social media.

"We want to see what the efficiency of document credibility assessment will be based on purely stylistic features," he added.

He emphasizes that its goal is to create an algorithm that not only detects fake news (which is the most glaring example of manipulated content) but also other propaganda techniques and bots.

How is the algorithm developed by the researcher created? First, his team collected a large database of English-language texts (about 100,000), which come from, among others, from fact-checking organizations (so-called fact-checking organizations). At the same time, the algorithm received information on which features to use to distinguish between reliable and unreliable texts.

"Our machine learning model kind of learns by itself - we give it input data with specific labelling and the features that describe that data. Then it is up to the algorithm to decide linking features with reliability," he describes.

The scientist points to the control of this process as the greatest difficulty. "It may be that, despite our efforts, the algorithm will be guided by premises that we would prefer it not to be guided by" - he adds.


Indicates that, for example, information from the BBC was identified as authoritative. "But we would not like our algorithm to consider as true only news that is written in the style of this particular British broadcaster," he says.

Dr Przybyła points out that many unreliable texts in the English-language media concern the political polarization in the USA. Many of them feature the names of Presidents Donald Trump and Barack Obama. Therefore, for the algorithm to work better and not to be "biased" to such words, Przybyła removed them from the texts passed to the algorithm. He hopes that in this way the data submitted for further analysis will be more objectified - the algorithm will receive information that, for example, a sentence consists of an adjective, noun, adverb, verb, and thus will be blind to the information that researchers want to filter out because they disrupt the algorithm's work.

The researchers themselves imposed the categories of words on the algorithm to make it easier to control. Three main stylistic categories of unreliable information were observed. First, they are words that describe judgment and concern values ​​and moral goals. Second, they are words that describe power, respect, and influence. The third group consists of words that are strongly influenced by emotions - both positive and negative.

In turn, reliable texts cite other sources and present numerous data.

"Of course, this is a great simplification, because we distinguished over 900 features that guide our algorithm" - he adds.


Przybyła focused on testing the method for the English language because it is known to all researchers working in this field. "It is also easier to access a large number of well-prepared and proven data, which improves our work" - he notes. Only then - when the assumptions of the model turn out to be correct - will it be possible to create an analogous algorithm for other languages, including Polish.

The algorithm is already doing 80-90 per cent, but such efficiency is not satisfactory for the researcher. Therefore, work on improving it continues. The next stage of the project will be testing its operation on internet users. Scientists want to check how it will affect the perception of the credibility of content by people.

According to Dr Przybyła, it is not worth combining this algorithm with others to create a "super-algorithm". "The user must know on what basis the machine makes decisions. It must be transparent. If it is not, then he may not trust it" - he emphasizes.

Przybyła is against automating the operation of the algorithm - for example by cutting off the user from content he considers false. However, such a decision should be made by the person himself - he emphasizes.

The project of dr. The new HOMADOS is financed by the Polish National Agency for Academic Exchange under the "Polish Returns" (PAP)

author: Szymon Zdzierałowski

Read also