Resources

Use existing data sets

Methods Specific Papers

  1. Dodds, P. S., Harris, K. D., Kloumann, I. M., Bliss, C. A., & Danforth, C. M. (2011). Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PloS One, 6(12), e26752.

Applications

  1. Sandhaus, E. (2008). The New York Times Annotated Corpus [data file]. Available here.

Gather new data

Methods Specific Papers

  1. Dodds, P. S., Harris, K. D., Kloumann, I. M., Bliss, C. A., & Danforth, C. M. (2011). Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PloS One, 6(12), e26752.

Applications

  1. Frimer, J. A., Aquino, K., Gebauer, J. E., Zhu, L. L., & Oakes, H. (2015). A decline in prosocial language helps explain public disapproval of the US Congress. Proceedings of the National Academy of Sciences, 112(21), 6591-6594. (social science)
  2. Dehghani et al, 2015
  3. Mohammad, S. M., & Turney, P. D. (2010). Emotions evoked by common words and phrases: Using Mechanical Turk to create an emotion lexicon. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text (pp. 26-34). Stroudsburg, PA: Association for Computational Linguistics. (social science)

Preprocessing

Methods Specific Papers

  1. Boyd, R. L., (2012). ConverSplitter (Version 0.8.0 beta) [software]. Available here.
  2. Boyd, R L. (2015). MEH: Meaning Extraction Helper (Version 1.3.01) [software]. Available here.
  3. Boyd, R L. (2015). RIOT Scan (Version 2.0.11) [software]. Available here.
  4. IBM Corporation (2011). IBM SPSS Modeler 14.2 Algorithms Guide [software manual]. Available here.
  5. Gagolewski, M., & Tartanus, B. (2015) Stringi: The string processing package for R [software]. Available here.
  6. Vijayarani, S., Ilamathi, J., & Nithya, (2015). Preprocessing techniques for text mining – An overview. International Journal of Computer Science & Communication Networks, 5(1), 7-16.
  7. Porter, M. F. (2006). An algorithm for suffix stripping. Program: Electronic Library and Information Systems, 40(3), 211–218.
  8. Feinerer, I., Meyer, D., & Hornik, K. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 1–54.
  9. Abell, M. (2014). SAS Text Miner [Software]. Available here.

Applications

  1. Han, B., & Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a# Twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (pp. 368-378). Stroudsburg, PA: Association for Computational Linguistics.
  2. Manning, C. D., & Schutze H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press
  3. Salton, G. (1989). Automatic text processing, the transformation, analysis, and retrieval of information by computer. Reading: Addison-Wesley.

Measure known topic/ category

Methods Specific Papers

  1. Pennebaker, J. W., Booth, R. J., & Francis, M. E., (2007). Linguistic Inquiry and Word Count: LIWC2007 [software manual]. Available here.
  2. Péladeau, N. (2003). WordStat: Content analysis module for SIMSTAT. Montreal, Canada: Provalis Research
  3. Akthar, F., & Hahne, C. (2012). Rapidminer 5 Operator Reference. Rapid-I GmbH.
  4. Crossley, S. A., Allen, L. K., Kyle, K., & McNamara, D. S. (2014). Analyzing discourse processing using a simple natural language processing tool. Discourse Processes 51(5-6), 511-534.

Applications

  1. Boyd, R. L., & Pennebaker, J. W. (2015). Did Shakespeare write double falsehood? Identifying individuals by creating psychological signatures with text analysis. Psychological Science, 26(5), 570–582.
  2. Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54.
  3. Graham, J., Haidt, J., & Nosek, B. A. (2009). Liberals and conservatives rely on different sets of moral foundations. Journal of Personality and Social Psychology, 96(5), 1029-1045.
  4. Hancock, J. T., Landrigan, C., & Silver, C. (2007). Expressing emotion in text-based communication. In Proceedings of the CHI’07 conference on human factors in computing systems (pp. 929-932). New York, NY: Association for Computing Machinery Press

Identify new topic or category

Methods Specific Papers

  1. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
  2. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10-18.
  3. Tan, S., Bu, J., Qin, X., Chen, C., & Cai, D. (2014). Cross domain recommendation based on multi-type media fusion. Neurocomputing, 127, 124-134.
  4. Boyd, R L. (2015). RIOT Scan (Version 2.0.11) [software]. Available here.

Applications

  1. Chen, Z., Koh, P. W., Ritter, P. L., Lorig, K., Bantum, E. O. C., & Saria, S. (2015). Dissecting an online intervention for cancer survivors: Four exploratory analyses of internet engagement and Its effects on health status and health behaviors. Health Education & Behavior, 42(1), 32-45.
  2. Ramage, D., Dumais, S. T., & Liebling, D. J. (2010). Characterizing microblogs with topic models. In Proceedings of the Fourth International AAAI Conference on Weblogs and Social Media (pp. 130-137). Menlo Park, CA: AAAI Press
  3. Stark, A., Shafran, I., & Kaye, J. (2011). Supervised and unsupervised feature selection for inferring social nature of telephone conversations from their content. In Automatic Speech Recognition and Understanding (pp. 449-454). New York, NY: IEEE.
  4. Schwartz, H. A., Eichstaedt, J., Kern, M. L., Park, G., Sap, M., Stillwell, D., Kosinski, M., & Ungar, L. (2014). Towards assessing changes in degree of depression through Facebook. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, (pp. 118-125). Stroudsburg, PA: Association for Computational Linguistics.
  5. Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101(1), 5228–5235.
  6. Mcauliffe, J. D., & Blei, D. M.(2008). Supervised topic models. In J. C. Platt, D. Koller, Y. Singer, & S. T. Roweis (Eds.), Advances in Neural Information Processing Systems (pp. 121–128). Cambridge, MA: MIT Press
  7. Resnik, P., Philip, R., William, A., Leonardo, C., Thang, N., Viet-An, N., & Jordan, B. G. (2015). Beyond LDA: Exploring supervised topic modeling for depression-related language in Twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (pp. 54-60). Denver, CO: Association for Computational Linguistics
  8. Boyd, R. L., & Pennebaker, J. W. (2015). Did Shakespeare write double falsehood? Identifying individuals by creating psychological signatures with text analysis. Psychological Science, 26(5), 570–582.
  9. Ramirez-Esparza, N., Chung, C. K., Kacewicz, E., & Pennebaker, J. W. (2008). The psychology of word use in depression forums in English and in Spanish: Testing two text analytic approaches. In International Conference on Weblogs and Social Media (pp. 102-108). Menlo Park, CA: AAAI Press
  10. Chung, C. K., & Pennebaker, J. W. (2008). Revealing dimensions of thinking in open-ended self-descriptions: An automated meaning extraction method for natural language. Journal of Research in Personality, 42(1), 96-132

Expand existing topic or category

Methods Specific Papers

  1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ICRL Workshop, 1-12.
  2. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the Empirical Methods in Natural Language Processing 2014 (pp. 1532-1543). Stroudsburg, PA: Association of Computational Linguistics
  3. Andrzejewski, D., & Zhu, X. (2009). Latent Dirichlet Allocation with topic-in-set knowledge In Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing (pp. 43–48). Stroudsburg, PA: Association for Computational Linguistics.

Applications

  1. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391–407.
  2. Dehghani, M., Gratch, J., Sachdeva, S., & Sagae, K. (2014). Analyzing conservative and liberal blogs related to the construction of the ‘Ground Zero Mosque’. Journal of Information Technology & Politics, 11(1), 1-14.

Comparing known groups

Methods Specific Papers

  1. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: an update. ACM SIGKDD Explorations Newsletter, 11(1), 10-18.
  2. Borgelt, C., & Rodríguez, G. G. (2007, July). FrIDA-a free intelligent data analysis toolbox. In Proceedings IEEE International Conference on Fuzzy Systems (pp. 1892–1896). London; IEEE.
  3. Cortes, C., Corinna, C., & Vladimir, V. (1995). Support-vector networks. Machine Learning, 20 (3), 273–297.
  4. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Image net classification with deep convolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (Eds.) Advances in neural information processing systems (pp. 1097-1105). Red Hook, NY: Curran Associates, Inc
  5. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ICRL Workshop, 1-12.
  6. Nguyen, L. T., Wu, P., Chan, W., Peng, W., & Zhang, Y. (2012). Predicting collective sentiment dynamics from time-series social media. In Proceedings of the first international workshop on issues of sentiment discovery and opinion mining (p. 1-8). New York, NY: ACM.
  7. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
  8. Liaw, A., & Weiner, M. (2002). Classification and regression by random forest. R News, 2(3), 18-22.
  9. Filatova, E. (2012). Irony and sarcasm: Corpus generation and analysis using crowdsourcing. In Proceedings of Language Resources and Evaluation Conference (pp. 392–398). Istanbul, Turkey: European Language Resources Association.

Applications

  1. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In C. Nédellec, & C. Rouveirol (Eds.) Machine learning: ECML-98 (pp. 4-15). Berlin Heidelberg: Springer.
  2. Dehghani, M., Gratch, J., Sachdeva, S., & Sagae, K. (2014). Analyzing conservative and liberal blogs related to the construction of the ‘Ground Zero Mosque’. Journal of Information Technology & Politics, 11(1), 1-14.
  3. Sumner et al. 2012
  4. Nguyen, M. T., & Lim, E. P. (2014). On predicting religion labels in microblogging networks. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval (pp. 1211-1214). New York, NY: ACM.
  5. Sumner et al. 2012
  6. Stark, A., Shafran, I., & Kaye, J. (2011). Supervised and unsupervised feature selection for inferring social nature of telephone conversations from their content. In Automatic Speech Recognition and Understanding (pp. 449-454). New York, NY: IEEE.
  7. Diermeier, D., Daniel, D., Jean-François, G., Bei, Y., & Stefan, K. (2011). Language and ideology in congress. British Journal of Political Science, 42(1), 31–55.
  8. Dehghani, M., Gratch, J., Sachdeva, S., & Sagae, K. (2014). Analyzing conservative and liberal blogs related to the construction of the ‘Ground Zero Mosque’. Journal of Information Technology & Politics, 11(1), 1-14.

Identify language that reflects latent attitudes/ opinions

Methods Specific Papers

  1. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41, 391–407.
  2. Kim, S. M., & Hovy, E. (2006). Extracting opinions, opinion holders, and topics expressed in online news media text. In Proceedings of the Workshop on Sentiment and Subjectivity in Text (pp. 1-8). Stroudsburg, PA: Association for Computational Linguistics.

Applications

  1. Rude, S., Gortner, E.-M., & Pennebaker, J. (2004). Language use of depressed and depression-vulnerable college students. Cognition and Emotion, 18(8), 1121–1133.
  2. Mitchell, L., Frank, M., Harris, K., Dodds, P., & Danforth, C. (2013). The geography of happiness: Connecting Twitter sentiment and expression, demographics, and objective characteristics of place. PLOS One, 8(5), e64417. doi:10.1371/journal.pone.0064417

Identify new groups

Methods Specific Papers

  1. Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3), 241–254.
  2. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. Lecam & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability Volume 1 (pp. 281-297). Los Angeles, CA: University of California Press
  3. Lin, D., & Wu, X. (2009). Phrase clustering for discriminative learning. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2, (pp. 1030-1038). Stroudsburg, PA: Association for Computational Linguistics

Applications

  1. Ritter, R. S., & Preston, J. L. (2013). Representations of religious words: Insights for religious priming research. Journal for the Scientific Study of Religion, 52(3), 494-507.

Multi-method comparisons

Applications

  1. Sumner et al., 2012
  2. Eichstaedt, J. C., Schwartz, H. A., Kern, M. L., Park, G., Labarthe, D. R., Merchant, R. M., & Seligman, M. E. P. (2015). Psychological language on Twitter predicts county-level heart disease mortality. Psychological Science, 26(2), 159–169.
  3. Stark, A., Shafran, I., & Kaye, J. (2011). Supervised and unsupervised feature selection for inferring social nature of telephone conversations from their content. In Automatic Speech Recognition and Understanding (pp. 449-454). New York, NY: IEEE.

Comparisons to other variables

Applications

  1. D'Mello, S., & Graesser, A. (2012). Language and0020discourse are powerful signals of student emotions during tutoring. IEEE Transactions on Learning Technologies, 5(4), 304-317.
  2. Eichstaedt, J. C., Schwartz, H. A., Kern, M. L., Park, G., Labarthe, D. R., Merchant, R. M., & Seligman, M. E. P. (2015). Psychological language on Twitter predicts county-level heart disease mortality. Psychological Science, 26(2), 159–169.
  3. Tausczik, Y. R., & Pennebaker, J. W. (2010). The psychological meaning of words: LIWC and computerized text analysis methods. Journal of Language and Social Psychology, 29(1), 24–54.
  4. Chae, D. H., Clouston, S., Hatzenbuehler, M. L., Kramer, M. R., Cooper, H. L., Wilson, S. M., ... & Link, B. G. (2015). Association between an internet-based measure of area racism and Black mortality. PLoS ONE 10(4): e0122963.

Follow-up Experiments

  1. Dehghani et al., 2015