Topic Modeling

Over the last decade topic modelling has become increasingly more popular among humanities scholars studying textual data. In a nutshell, topic modelling is an algorithmic tool that can identify the key topics of a text; furthermore, in a collection of texts it can retrieve texts that are the most relevant examples of a given topic. In the last weeks we experimented with topic modelling on the EuroNews corpus. In this post I am first offering a very gentle introduction into topic modelling; next I am presenting the results that topic modelling has given us. 

First of all, we need to clarify what a topic means. When asked to give examples of topics, most people would say single words such as “Military”, “Politics”, “Space”, etc. For humans these words meaningfully describe different pieces (or more technically speaking, domains) of the world. We have a tacit background knowledge that helps us recognize the topic “Military” in a text. Computers work differently. For them a topic is a set of words and not only a label. Consider the following example:

god, jesus, christ, christian, bible, christians, hell, faith, lord, paul, believe

This group of words was extracted with the help of a topic modelling algorithm from a dataset known as 20 Newsgroups. This is a collection of approximately 20,000 newsgroup documents. Even though the word “religion” is not mentioned, it is quite clear that these words refer to the topic religion. How could a topic modelling algorithm identify these words? How does a computer know that as a group they form a topic? The short – and highly simplified – answer is that these words regularly co-occur in texts; topic modelling aims to identify words that are somehow connected in a collection of texts and treat them as topics. 

As a next step, practitioners of topic modelling usually ask humans to label the group of words with meaningful topical categories such as religion. Finally, they extract texts that are relevant examples of a given topic. The following text from the 20,000 newsgroup dataset illustrates the topic ‘religion’:

From time to time a term like ‘Oneness Pentecostals’ (or something

similar) has occurred in posts to this group. I also know that there

is a movement called something like ‘Jesus alone.’ 

I believe in the Trinity and have no plans to change that, but recently

I was made aware that there is at least one person within our church

who holds the view that there is no trinity. In the near future we

will discuss this item, and I feel that I shall ask you, my friends on

this group, for background information.

On the Euronews Corpus I applied the topic modelling algorithms, LDA that stands for Latent dirichlet allocation. This uncovered the following list of topics, e.g. groups of topic words:

✨Topic 0✨

[‘giorno’, ‘notte’, ‘galera’, ‘sera’, ‘mattina’, ‘duca’, ‘squadra’, ‘domenica’, ‘porto’, ‘volta’, ‘arrivo’, ‘havendo’, ‘voce’, ‘detto’, ‘venuta’, ‘corriero’, ‘avviso’, ‘ordine’, ‘numero’, ‘andare’]

✨Topic 1✨

[‘parlamento’, ‘lor’, ‘fatto’, ‘regno’, ‘hora’, ‘haveva’, ‘fare’, ‘mandare’, ‘havere’, ‘essere’, ‘havendo’, ‘risposta’, ‘ambasciatore’, ‘volere’, ‘settimana’, ‘passare’, ‘passato’, ‘parte’, ‘trattato’, ‘risoluto’]

✨Topic 2✨

[‘città’, ‘notte’, ‘casa’, ‘fatto’, ‘danno’, ‘tempo’, ‘mano’, ‘fuoco’, ‘giorno’, ‘cavallo’, ‘stato’, ‘piazza’, ‘parte’, ‘ordine’, ‘guardia’, ‘luogo’, ‘governo’, ‘via’, ‘numero’, ‘hore’]

✨Topic 3✨

[‘cardinale’, ‘fatto’, ‘vescovo’, ‘detto’, ‘papa’, ‘duca’, ‘haver’, ‘stato’, ‘anco’, ‘causa’, ‘signore’, ‘chiesa’, ‘hora’, ‘città’, ‘ordine’, ‘habbia’, ‘havuto’, ‘corte’, ‘havendo’, ‘favore’]

✨Topic 4✨

[‘settimana’, ‘contagio’, ‘città’, ‘male’, ‘peste’, ‘regno’, ‘numero’, ‘terra’, ‘provincia’, ‘popolo’, ‘segno’, ‘libertà’, ‘sorte’, ‘salute’, ‘morire’, ‘grano’, ‘maniera’, ‘mandare’, ‘somma’, ‘passato’]

✨Topic 5✨

[‘galee’, ‘passato’, ‘delli’, ‘alli’, ‘armata’, ‘volta’, ‘ordine’, ‘guerra’, ‘regno’, ‘provisioni’, ‘corte’, ‘impresa’, ‘porto’, ‘aviso’, ‘anno’, ‘bene’, ‘oro’, ‘havendo’, ‘corriero’, ‘partito’]

✨Topic 6✨

[‘duca’, ‘stato’, ‘fatto’, ‘conte’, ‘principe’, ‘morte’, ‘fratello’, ‘detto’, ‘generale’, ‘ordine’, ‘anco’, ‘padre’, ‘casa’, ‘vita’, ‘causa’, ‘haveva’, ‘città’, ‘doppo’, ‘regno’, ‘prigione’]

✨Topic 7✨

[‘mattina’, ‘giorno’, ‘chiesa’, ‘città’, ‘sera’, ‘doppo’, ‘festa’, ‘domenica’, ‘solito’, ‘signore’, ‘lunedì’, ‘giovedì’, ‘passato’, ‘fare’, ‘detto’, ‘popolo’, ‘fatto’, ‘principe’, ‘cardinale’, ‘martedì’]

✨Topic 8✨

[‘stato’, ‘generale’, ‘carica’, ‘governatore’, ‘governo’, ‘consiglio’, ‘conte’, ‘signore’, ‘luogo’, ‘corte’, ‘ordine’, ‘regno’, ‘ritorno’, ‘campagna’, ‘provincia’, ‘duca’, ‘carico’, ‘capitano’, ‘passare’, ‘settimana’]

✨Topic 9✨

[‘città’, ‘terra’, ‘fatto’, ‘havendo’, ‘delli’, ‘haver’, ‘hora’, ‘passato’, ‘alli’, ‘anco’, ‘castello’, ‘numero’, ‘detto’, ‘artiglieria’, ‘luogo’, ‘gente’, ‘parte’, ‘cosa’, ‘governatore’, ‘doppo’]

✨Topic 10✨

[‘duca’, ‘signora’, ‘principe’, ‘casa’, ‘signore’, ‘moglie’, ‘matrimonio’, ‘conte’, ‘detto’, ‘ambasciatore’, ‘figliolo’, ‘cardinale’, ‘fratello’, ‘giorno’, ‘domenica’, ‘mattina’, ‘ritorno’, ‘alli’, ‘nome’, ‘corte’]

✨Topic 11✨

[‘pace’, ‘imperatore’, ‘trattato’, ‘fatto’, ‘guerra’, ‘parte’, ‘ambasciatore’, ‘cosa’, ‘fine’, ‘anco’, ‘corte’, ‘stato’, ‘dieta’, ‘intendere’, ‘havendo’, ‘paese’, ‘fare’, ‘ministro’, ‘consiglio’, ‘haver’]

✨Topic 12✨

[‘ambasciatore’, ‘conte’, ‘stato’, ‘corte’, ‘duca’, ‘alli’, ‘partito’, ‘principe’, ‘passato’, ‘ritorno’, ‘viaggio’, ‘anco’, ‘doppo’, ‘imperatore’, ‘gionto’, ‘corriero’, ‘volta’, ‘arrivo’, ‘trattato’, ‘giorno’]

✨Topic 13✨

[‘stato’, ‘venire’, ‘effetto’, ‘mese’, ‘fatto’, ‘campagna’, ‘grano’, ‘tempo’, ‘ordine’, ‘città’, ‘servire’, ‘essere’, ‘danaro’, ‘quantità’, ‘prendere’, ‘habbia’, ‘risoluto’, ‘partito’, ‘anno’, ‘bisogno’]

✨Topic 14✨

[‘porto’, ‘mare’, ‘vasselli’, ‘armata’, ‘guerra’, ‘nave’, ‘avviso’, ‘squadra’, ‘flotta’, ‘terra’, ‘essere’, ‘havendo’, ‘huomini’, ‘alli’, ‘carico’, ‘corrente’, ‘delli’, ‘piazza’, ‘via’, ‘volta’]

✨Topic 15✨

[‘ambasciatore’, ‘cardinale’, ‘mattina’, ‘signore’, ‘papa’, ‘stato’, ‘hieri’, ‘giovedì’, ‘sera’, ‘corriero’, ‘udienza’, ‘audienza’, ‘doppo’, ‘domenica’, ‘chiesa’, ‘giorno’, ‘detto’, ‘solito’, ‘fatto’, ‘casa’]

✨Topic 16✨

[‘dieta’, ‘alli’, ‘imperatore’, ‘regno’, ‘fare’, ‘principio’, ‘mese’, ‘delli’, ‘casa’, ‘contra’, ‘tempo’, ‘dalli’, ‘corte’, ‘luogo’, ‘cosa’, ‘religione’, ‘anno’, ‘haveva’, ‘hora’, ‘hoggi’]

✨Topic 17✨

[‘viaggio’, ‘partenza’, ‘principe’, ‘corte’, ‘volta’, ‘tempo’, ‘ordine’, ‘fare’, ‘partire’, ‘ritorno’, ‘andare’, ‘settimana’, ‘passare’, ‘mese’, ‘alli’, ‘stato’, ‘fine’, ‘salute’, ‘ora’, ‘arrivare’]

✨Topic 18✨

[‘campo’, ‘paese’, ‘anco’, ‘armata’, ‘gente’, ‘numero’, ‘parte’, ‘huomini’, ‘conte’, ‘guerra’, ‘ordine’, ‘città’, ‘quantità’, ‘pace’, ‘generale’, ‘campagna’, ‘artiglieria’, ‘partito’, ‘servitio’, ‘fare’]

✨Topic 19✨

[‘duca’, ‘ambasciatore’, ‘imperatore’, ‘fatto’, ‘corte’, ‘principe’, ‘cardinale’, ‘havendo’, ‘hora’, ‘detto’, ‘titolo’, ‘conte’, ‘signore’, ‘alli’, ‘acciò’, ‘conto’, ‘nome’, ‘haver’, ‘stato’, ‘havuto’]

✨Topic 20✨

[‘parte’, ‘generale’, ‘armata’, ‘gente’, ‘haveva’, ‘fatto’, ‘stato’, ‘campo’, ‘alli’, ‘esercito’, ‘detto’, ‘anco’, ‘soccorso’, ‘fare’, ‘tempo’, ‘passato’, ‘havendo’, ‘duca’, ‘campagna’, ‘principe’]

✨Topic 21✨

[‘anno’, ‘città’, ‘pagare’, ‘alli’, ‘somma’, ‘corrente’, ‘casa’, ‘dare’, ‘denaro’, ‘mese’, ‘ordine’, ‘delli’, ‘trovare’, ‘interesse’, ‘passato’, ‘modo’, ‘paese’, ‘parlamento’, ‘flotta’, ‘regno’]

✨Topic 22✨

[‘parlamento’, ‘regno’, ‘lor’, ‘fare’, ‘essere’, ‘hora’, ‘havere’, ‘bene’, ‘parte’, ‘stato’, ‘religione’, ‘città’, ‘governo’, ‘tempo’, ‘causa’, ‘hoggi’, ‘popolo’, ‘volere’, ‘guerra’, ‘modo’]

✨Topic 23✨

[‘stato’, ‘città’, ‘anco’, ‘duca’, ‘havendo’, ‘fatto’, ‘ordine’, ‘effetto’, ‘delli’, ‘alli’, ‘conto’, ‘essere’, ‘detto’, ‘cosa’, ‘acciò’, ‘habbia’, ‘gente’, ‘andare’, ‘accordo’, ‘fine’]

✨Topic 24✨

[‘male’, ‘doppo’, ‘stato’, ‘notte’, ‘giorno’, ‘vita’, ‘hieri’, ‘mattina’, ‘ora’, ‘tempo’, ‘bene’, ‘sera’, ‘fatto’, ‘parte’, ‘sentire’, ‘mezzo’, ‘havendo’, ‘signore’, ‘haver’, ‘cosa’]

As you might see, a fundamental problem is the assignment of meaningful topic labels to each group of words. This is difficult because not all groups of words allow us to assign a meaningful topic label. For instance, take topic 17, this seems to be meaningulf; it refers to physical movement of historical actors.  Similarly, topic 18 seems to refer to the topic military. However, topic 23 does not seem to be a meaningful topic. 

All this highlights the key difficulty that topic modelling involves. Sometimes words identified as parts of a topic are meaningful; sometimes they are not meaningful. How can we decide which group of words form a meaningful topic? What shall we do with groups of words that do not seem to be meaningful? This is the one million dollar question that topic modelling itself cannot resolve. We therefore decided to annotate manually the corpus and let humans decide about the topic of each news.

Hence, the use of topic modelling for the analysis of content remains quite challenging for us! 

Sources and references:

Alsumait, Loulwah & Barbara, Daniel & Gentle, James & Domeniconi, Carlotta. (2009). Topic Significance Ranking of LDA Generative Models. 67-82. 10.1007/978-3-642-04180-8_22.  

20 Newsgroups

http://qwone.com/~jason/20Newsgroups/

Gabor Toth

Leave a Reply

Your email address will not be published.

Back to top