Συλλογή δεδομένων από WhatsApp και Telegram
Data crawling on WhatsApp and Telegram

View/ Open
Keywords
WhatsApp ; Telegram ; Data crawling ; WhatsApp groups ; Telegram groups ; Telegram channels ; Messasing ; K-means algorithm ; Python ; Clustering ; TF-IDF ; LDA topic modelingAbstract
Text Message Apps have dramatically changed the way people communicate and interact.
However, the plethora of information and the ease of sharing it hinder serious problems in various
areas of our daily life, given that the content of the messages may refer to fake news, illegal
material and other types of illegal or immoral activities. This generates the need to timely and
accurately identify messages that may contain illegal material, especially in mass media
communication channels. Yet, analysis of the data communicated through these services with the
objective to identify the topic of discussion in the different groups or channels is challenging, as
messages are typically small in size and may contain multimedia. An equally important challenge is
to deter and prevent the breach of the terms of use of these services by malicious users, wich also
requires content analysis against the terms.
The goal of this thesis is to assist in the direction of timely identifying messages on social
media channels that may be indicative of unfair actions or breach of terms, through a primary
collective analysis of messages. Specifically, we focus on data collection and analysis in two of the
most popular services in the field of Text Message Apps, namely WhatsApp and Telegram. We
approach the problem in distinct phases. First, we create two unique datasets fro each application,
derived from conversations among members of different groups within each application. Second,
we analyse this data using the K-means algorithm as well as the LDA probability distribution, aiming
to identify keywords that will help us decipher key discussion topics within the groups. Next, for
each cluster or topic respectively we depict the terms that seem to be most popular using Word
Clouds. Finally, for each message on the channel, we calculate the probability that this message belongs to one of a specific cluster / topic, and for each channel we return the distribution of
messages to the respective clusters or topics.