class: center, middle, inverse, title-slide # Workshop de Topic Modeling ## Slides –
.white[storopoli.io/topic-modeling-workshop]
### Jose Storopoli, PhD ###
###
### 04/05/2021 --- class: animated, fadeIn layout: true ---
# O que é modelagem de tópicos? <img src="images/topic-modeling.jpg" width="100%" /> .footnote[ Jordan, M. I., & Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245), 255–260. https://doi.org/10.1126/science.aaa8415 ] ??? Modelo Probabilístico --- class: inverse, middle, center # Iramuteq vs Topic Modeling <img src="images/fight.jpg" width="200" /> --- .footnote[ Reinert, M. (1990). Alceste une méthodologie d'analyse des données textuelles et une application: Aurelia De Gerard De Nerval. Bulletin of Sociological Methodology/Bulletin de méthodologie sociologique, 26(1), 24-54. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022. ] .pull-left[ ## Iramuteq * Clusterização Hierárquica * Estimativa Pontual * Erro * Muitos "graus de liberdade" do pesquisador * Reinert (1990) - 787 citações ] -- .pull-right[ ## Topic Modeling * Modelo Probabilístico * Densidade Posterior * Incerteza * Apenas um "grau de liberdade" do pesquisador * Blei et al. (2003) - 37.287 citações * Publicado na Nature, PLoS, PNAS etc. * Usado pela Amazon ] ??? Iramuteq é clusterização -> Point Estimate e um único pertencimento Enquanto TM é um modelo generativo probabilístico (bayesiano) -> Densidade posterior completa e uma probabilidade de pertencimento `\(\sum_p =1 e \mathbf{p} = (p_1, \dots, p_k)\)`. --- class: inverse, middle, center # Mas ainda temos que processar texto <img src="images/corpus.jpg" width="75%" /> --- ## Pré-processamento de Texto
.footnote[ Denny, M. J., & Spirling, A. (2018). Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It. Political Analysis, 26(2), 168–189. https://doi.org/10.1017/pan.2017.44 Storopoli, J. E. (2019). Topic Modeling: How and why to use in management research. Iberoamerican Journal of Strategic Management (IJSM), 18(3), 8–20. ] --- # Structural Topic Modeling (STM) .small[ * Topic Modeling com Esteroides * Metadados dos documentos para gerar inferências sobre a prevalência e conteúdo de cada tópico * Além de descobrir tópicos * Analisa a relação das informações dos documentos com os tópicos * Farrell (2016) analisou mais de 40 mil documentos sobre mudança climática de 120 organizações * Kuhn (2018) analisou mais de 25 mil relatórios de acidentes de aviação ] .footnote[ Roberts, M. E., Stewart, B. M., Tingley, D., Lucas, C., Leder-Luis, J., Gadarian, S. K., Albertson, B., & Rand, D. G. (2014). Structural Topic Models for Open-Ended Survey Responses. American Journal of Political Science, 58(4), 1064–1082. https://doi.org/10.1111/ajps.12103 Farrell, J. (2016). Corporate funding and ideological polarization about climate change. Proceedings of the National Academy of Sciences, 113(1), 92–97. https://doi.org/10.1073/PNAS.1509433112 Kuhn, K. D. (2018). Using structural topic modeling to identify latent topics and trends in aviation incident reports. Transportation Research Part C: Emerging Technologies, 87, 105-122. ] --- class: inverse, middle, center # Ferramentas <svg viewBox="0 0 512 512" style="position:relative;display:inline-block;top:.1em;fill:white;height:6em;" xmlns="http://www.w3.org/2000/svg"> <path d="M501.1 395.7L384 278.6c-23.1-23.1-57.6-27.6-85.4-13.9L192 158.1V96L64 0 0 64l96 128h62.1l106.6 106.6c-13.6 27.8-9.2 62.3 13.9 85.4l117.1 117.1c14.6 14.6 38.2 14.6 52.7 0l52.7-52.7c14.5-14.6 14.5-38.2 0-52.7zM331.7 225c28.3 0 54.9 11 74.9 31l19.4 19.4c15.8-6.9 30.8-16.5 43.8-29.5 37.1-37.1 49.7-89.3 37.9-136.7-2.2-9-13.5-12.1-20.1-5.5l-74.4 74.4-67.9-11.3L334 98.9l74.4-74.4c6.6-6.6 3.4-17.9-5.7-20.2-47.4-11.7-99.6.9-136.6 37.9-28.5 28.5-41.9 66.1-41.2 103.6l82.1 82.1c8.1-1.9 16.5-2.9 24.7-2.9zm-103.9 82l-56.7-56.7L18.7 402.8c-25 25-25 65.5 0 90.5s65.5 25 90.5 0l123.6-123.6c-7.6-19.9-9.9-41.6-5-62.7zM64 472c-13.2 0-24-10.8-24-24 0-13.3 10.7-24 24-24s24 10.7 24 24c0 13.2-10.7 24-24 24z"></path></svg> --- class: middle # R ### [`{stm}`](https://www.structuraltopicmodel.com/) e [`{quanteda}`](http://quanteda.io/) # Python ### [`gensim`](https://radimrehurek.com/gensim/) e [`scikit-learn`](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction) # Julia ### [`TextAnalysis.jl`](https://juliahub.com/docs/TextAnalysis) --- class: inverse, middle <img src="images/case-study.jpg" width="100%" /> --- # Senhor dos Anéis - [Kaggle](https://www.kaggle.com/paultimothymooney/lord-of-the-rings-data?select=lotr_scripts.csv) .pull-left[ ### `lotr_scripts.csv` * 2,389 falas * `char`: Personagem * `dialog`: Fala * `movie`: Filme ] .pull-right[ ### `lotr_characters.csv` * 911 personagens * `race`: Elfo, Orc, Humano etc. * `gender`: Male, Female ] .footnote[ Kaggle - https://www.kaggle.com/paultimothymooney/lord-of-the-rings-data ] --- # Preparação dos dados .pull-left[ ```r library(readtext) df <- readtext( "data/lotr_scripts.csv", * text_field = "dialog") ``` ] .pull-right[ * FRODO * SAM * GANDALF * ARAGORN * PIPPIN * MERRY * GOLLUM * GIMLI * LEGOLAS ] --- # Corpus e Tokens ```r library(quanteda) *corpus <- corpus(df) summary(corpus) ``` ``` ## Corpus consisting of 2390 documents, showing 3 documents: ## ## Text Types Tokens Sentences id char movie ## lotr_scripts.csv.1 9 15 1 0 Other The Return of the King ## lotr_scripts.csv.2 9 17 2 1 Other The Return of the King ## lotr_scripts.csv.3 2 2 1 2 Other The Return of the King ``` ```r toks <- tokens(corpus, * remove_punct = TRUE, * remove_symbols = TRUE, * remove_numbers = TRUE, * remove_separators = TRUE, * split_hyphens = TRUE ) ``` --- # Document-Term Matrix (`dtm`) ```r library(stopwords) dfm_mat <- dfm(toks, * tolower = TRUE) dfm_mat <- dfm_remove(dfm_mat, * pattern = stopwords(language = "en", source = "snowball")) dfm_mat <- dfm_wordstem(dfm_mat, * language = "en") dfm_mat ``` ``` ## Document-feature matrix of: 2,390 documents, 8 features (99.45% sparse) and 3 docvars. ## features ## docs smeagol fish pull arrghh deagol love birthday precious ## lotr_scripts.csv.1 3 1 0 0 0 0 0 0 ## lotr_scripts.csv.2 0 0 2 0 0 0 0 0 ## lotr_scripts.csv.3 0 0 0 1 0 0 0 0 ## lotr_scripts.csv.4 0 0 0 0 1 0 0 0 ## lotr_scripts.csv.5 0 0 0 0 1 0 0 0 ## lotr_scripts.csv.6 0 0 0 0 1 0 0 0 ## [ reached max_ndoc ... 2,384 more documents ] ``` --- # Falas x Personagem x Filme <img src="index_files/figure-html/ggplot2-1.png" width="504" style="display: block; margin: auto;" /> --- # Quantos tópicos? .small[ Antes precisamos converter a `dtm` do `{quanteda}` para o `{stm}`: ] ```r dtm_stm <- convert(dfm_mat, to = "stm") ``` <img src="index_files/figure-html/how_many_k-1.png" width="50%" style="display: block; margin: auto;" /> --- # Topic Modeling ```r topic_model <- stm(dtm_stm$documents, vocab = dtm_stm$vocab, data = dtm_stm$meta, * K = 3, * prevalence =~ movie + char, * seed = 123) ``` .large[ ``` ## Topic 1 Top Words: ## Highest Prob: king, lord, gondor, smeagol, rohan, citi ## FREX: king, lord, gondor, smeagol, rohan, citi ## Lift: 3'7, 3'8, aaa, aaaaagh, ab, account ## Score: king, lord, gondor, theoden, smeagol, heh ## Topic 2 Top Words: ## Highest Prob: gandalf, sam, day, death, friend, war ## FREX: gandalf, sam, death, friend, wait, heart ## Lift: aaaaaaaaahhhhhhhhhh, aaaaah, aaaahh, aah, abdollen, accord ## Score: sam, gandalf, hmm, friend, grond, death ## Topic 3 Top Words: ## Highest Prob: frodo, master, hobbit, time, dead, merri ## FREX: frodo, hobbit, time, dead, merri, precious ## Lift: beacon, bilbo, brego, care, carri, eat ## Score: frodo, precious, merri, dead, kill, pippin ``` ] --- <img src="index_files/figure-html/plot-tm-1.png" width="504" style="display: block; margin: auto;" /> --- # STM .large[ ```r regression <- estimateEffect( * 1:3 ~ movie + char, topic_model, meta = dtm_stm$meta) summary(regression, topics = 1) ``` ] --- **Tópico 1**: king, lord, gondor, smeagol, rohan, citi, saruman ``` ## ## Call: ## estimateEffect(formula = 1:3 ~ movie + char, stmobj = topic_model, ## metadata = dtm_stm$meta, uncertainty = "Global") ## ## ## Topic 1: ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.3045 0.0287 10.59 < 2e-16 *** ## movieThe Return of the King 0.1119 0.0184 6.10 1.3e-09 *** ## movieThe Two Towers 0.1042 0.0172 6.06 1.6e-09 *** ## charFRODO -0.1365 0.0334 -4.09 4.5e-05 *** ## charGANDALF 0.0056 0.0318 0.18 0.86026 ## charGIMLI -0.1150 0.0378 -3.04 0.00240 ** ## charGOLLUM -0.1723 0.0351 -4.91 9.6e-07 *** ## charLEGOLAS -0.0327 0.0482 -0.68 0.49684 ## charMERRY -0.0885 0.0385 -2.30 0.02144 * ## charOther -0.0235 0.0252 -0.93 0.35045 ## charPIPPIN -0.1150 0.0335 -3.43 0.00061 *** ## charSAM -0.2510 0.0308 -8.15 6.0e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- **Tópico 2**: gandalf, sam, day, death, friend, war, tree ``` ## ## Call: ## estimateEffect(formula = 1:3 ~ movie + char, stmobj = topic_model, ## metadata = dtm_stm$meta, uncertainty = "Global") ## ## ## Topic 2: ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.31629 0.02627 12.04 < 2e-16 *** ## movieThe Return of the King -0.04632 0.01838 -2.52 0.012 * ## movieThe Two Towers 0.01102 0.01728 0.64 0.524 ## charFRODO 0.13651 0.03166 4.31 1.7e-05 *** ## charGANDALF -0.05346 0.03258 -1.64 0.101 ## charGIMLI 0.02569 0.03912 0.66 0.511 ## charGOLLUM -0.14073 0.03409 -4.13 3.8e-05 *** ## charLEGOLAS -0.02542 0.04409 -0.58 0.564 ## charMERRY -0.00247 0.03799 -0.06 0.948 ## charOther 0.03094 0.02389 1.30 0.195 ## charPIPPIN 0.01262 0.03397 0.37 0.710 ## charSAM 0.02116 0.03067 0.69 0.490 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- **Tópico 3**: frodo, master, hobbit, time, dead, merri, aragorn ``` ## ## Call: ## estimateEffect(formula = 1:3 ~ movie + char, stmobj = topic_model, ## metadata = dtm_stm$meta, uncertainty = "Global") ## ## ## Topic 3: ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.379540 0.026614 14.26 < 2e-16 *** ## movieThe Return of the King -0.065841 0.017155 -3.84 0.00013 *** ## movieThe Two Towers -0.115106 0.018120 -6.35 2.6e-10 *** ## charFRODO -0.000351 0.031306 -0.01 0.99105 ## charGANDALF 0.048076 0.032796 1.47 0.14283 ## charGIMLI 0.089524 0.038393 2.33 0.01981 * ## charGOLLUM 0.312678 0.034811 8.98 < 2e-16 *** ## charLEGOLAS 0.057565 0.048128 1.20 0.23180 ## charMERRY 0.091165 0.035883 2.54 0.01114 * ## charOther -0.007891 0.024354 -0.32 0.74595 ## charPIPPIN 0.102262 0.033542 3.05 0.00233 ** ## charSAM 0.229294 0.030934 7.41 1.8e-13 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- class: inverse, middle <img src="images/meme-final.jpg" width="100%" /> --- # Créditos! Slides criado pelo pacote R [`xaringan`](https://github.com/yihui/xaringan). Código Fonte dos Slides disponível no GitHub [storopoli/topic-modeling-workshop](https://github.com/storopoli/topic-modeling-workshop). .pull-left[ <img src="images/Profile Pic.png" width="70%" style="display: block; margin: auto auto auto 0;" /> [![CC BY-SA 4.0][cc-by-sa-image]][cc-by-sa] ] .pull-right[ [<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M336.5 160C322 70.7 287.8 8 248 8s-74 62.7-88.5 152h177zM152 256c0 22.2 1.2 43.5 3.3 64h185.3c2.1-20.5 3.3-41.8 3.3-64s-1.2-43.5-3.3-64H155.3c-2.1 20.5-3.3 41.8-3.3 64zm324.7-96c-28.6-67.9-86.5-120.4-158-141.6 24.4 33.8 41.2 84.7 50 141.6h108zM177.2 18.4C105.8 39.6 47.8 92.1 19.3 160h108c8.7-56.9 25.5-107.8 49.9-141.6zM487.4 192H372.7c2.1 21 3.3 42.5 3.3 64s-1.2 43-3.3 64h114.6c5.5-20.5 8.6-41.8 8.6-64s-3.1-43.5-8.5-64zM120 256c0-21.5 1.2-43 3.3-64H8.6C3.2 212.5 0 233.8 0 256s3.2 43.5 8.6 64h114.6c-2-21-3.2-42.5-3.2-64zm39.5 96c14.5 89.3 48.7 152 88.5 152s74-62.7 88.5-152h-177zm159.3 141.6c71.4-21.2 129.4-73.7 158-141.6h-108c-8.8 56.9-25.6 107.8-50 141.6zM19.3 352c28.6 67.9 86.5 120.4 158 141.6-24.4-33.8-41.2-84.7-50-141.6h-108z"></path></svg> storopoli.io](https://storopoli.io) [<svg viewBox="0 0 448 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M416 32H31.9C14.3 32 0 46.5 0 64.3v383.4C0 465.5 14.3 480 31.9 480H416c17.6 0 32-14.5 32-32.3V64.3c0-17.8-14.4-32.3-32-32.3zM135.4 416H69V202.2h66.5V416zm-33.2-243c-21.3 0-38.5-17.3-38.5-38.5S80.9 96 102.2 96c21.2 0 38.5 17.3 38.5 38.5 0 21.3-17.2 38.5-38.5 38.5zm282.1 243h-66.4V312c0-24.8-.5-56.7-34.5-56.7-34.6 0-39.9 27-39.9 54.9V416h-66.4V202.2h63.7v29.2h.9c8.9-16.8 30.6-34.5 62.9-34.5 67.2 0 79.7 44.3 79.7 101.9V416z"></path></svg> @storopoli](https://www.linkedin.com/in/storopoli/) [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> @JoseStoropoli](https://www.twitter.com/JoseStoropoli) [<svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> @storopoli](http://github.com/storopoli) [<svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;" xmlns="http://www.w3.org/2000/svg"> <path d="M440 6.5L24 246.4c-34.4 19.9-31.1 70.8 5.7 85.9L144 379.6V464c0 46.4 59.2 65.5 86.6 28.6l43.8-59.1 111.9 46.2c5.9 2.4 12.1 3.6 18.3 3.6 8.2 0 16.3-2.1 23.6-6.2 12.8-7.2 21.6-20 23.9-34.5l59.4-387.2c6.1-40.1-36.9-68.8-71.5-48.9zM192 464v-64.6l36.6 15.1L192 464zm212.6-28.7l-153.8-63.5L391 169.5c10.7-15.5-9.5-33.5-23.7-21.2L155.8 332.6 48 288 464 48l-59.4 387.3z"></path></svg> josees@uni9.pro.br](mailto:josees@uni9.pro.br) ] [cc-by-sa]: http://creativecommons.org/licenses/by-sa/4.0/ [cc-by-sa-image]: https://licensebuttons.net/l/by-sa/4.0/88x31.png [cc-by-sa-shield]: https://img.shields.io/badge/License-CC%20BY--SA%204.0-lightgrey.svg