Ary, D., Jacobs, L. C., Irvine, C. K. S., & Walker, D. (2018). Introduction to research in education (10th Ed.). Cengage Learning.
Ashraf, H., Sodergren, M. H., Merali, N., Mylonas, G., Singh, H., & Darzi, A. (2018). Eye-tracking technology in medical education: A systematic review. Medical teacher, 40(1), 62-69.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press.
Ballard, L. (2017). The effects of primacy on rater cognition: An eye-tracking study. Michigan State University.
Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74.
Bejar, I. I., Williamson, D. M., & Mislevy, R. J. (2006). Human scoring. Automated scoring of complex tasks in computer-based testing, 49-82.
Chen, K. T., Prouzeau, A., Langmead, J., Whitelock-Jones, R. T., Lawrence, L., Dwyer, T., ... & Goodwin, S. (2023, May). Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Comparative Gaze Analysis. In Proceedings of the 2023 Symposium on Eye Tracking Research and Applications (pp. 1-7). Preprint available at arXiv:2303.17202.
Conklin, K. & Pellicer-Sánchez, A. (2016). Using eye-tracking in applied linguistics and second language acquisition research. Second Language Research, 32(3), 453-467.
Cumming, A. (1990). Expertise in evaluating second-language compositions. Language Testing, 7(1), 31-51.
Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86(1), 67-96.
DeRemer, M. (1998). Writing assessment: Raters’ elaboration of the rating task. Assessing Writing, 5, 7–29.
Deygers, B., & Van Gorp, K. (2015). Determining the scoring validity of a co-constructed CEFR-based rating scale. Language Testing, 32(4), 521-541.
Diederich, P. B., French, J. W., & Carlton, S. T. (1961). Factors in judgments of writing ability. ETS Research Bulletin Series, 1961(2), i-93.
Dogan, C. D., & Uluman, M. (2017). A Comparison of Rubrics and Graded Category Rating Scales with Various Methods Regarding Raters' Reliability. Educational sciences: Theory and practice, 17(2), 631-651.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221.
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Peter Lang.
Eckstein, G., Casper, R., Chan, J., & Blackwell, L. (2018). Assessment of L2 student writing: Does teacher disciplinary background matter? Journal of Writing Research, 10(1), 1-23.
Elder, C., Knoch, U., Barkhuizen, G., & Von Randow, J. (2005). Individual feedback to enhance rater training: Does it work?. Language Assessment Quarterly: An International Journal, 2(3), 175-196.
Elder, C., Barkhuizen, G., Knoch, U., & Von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24(1), 37-64.
Engelhard Jr, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge.
Erguvan, I. D., & DÜNYA, B. A. (2021). Gathering evidence on e-rubrics: Perspectives and many facet Rasch analysis of rating behavior. International Journal of Assessment Tools in Education, 8(2), 454-474.
Erlam, R., von Randow, J., & Read, J. (2013). Investigating an online rater training program: product and process. Papers in Language Testing and Assessment, 2(1), 1-29.
Godfroid, A. (2019). Investigating instructed second language acquisition using L2 learners’ eye-tracking data. In The Routledge handbook of second language research in classroom learning (pp. 44-57). Routledge.
Godfroid, A., & Spino, L. A. (2015). Reconceptualizing reactivity of think‐alouds and eye tracking: Absence of evidence is not evidence of absence. Language Learning, 65(4), 896-928.
Godfroid, A., Winke, P., & Conklin, K. (2020). Exploring the depths of second language processing with eye tracking: An introduction. Second Language Research, 36(3), 243-255.
Gyamfi, G., Hanna, B. E., & Khosravi, H. (2022). The effects of rubrics on evaluative judgement: a randomised controlled experiment. Assessment & Evaluation in Higher Education, 47(1), 126-143.
Hamp-Lyons, L. (2007). Worrying about rating. Assessing Writing, 1(12), 1-9.
Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17(4), 228-250.
Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL composition: A practical approach. Rowley. Newbury House.
Janssen, G., Meier, V., & Trace, J. (2015). Building a better rubric: Mixed methods rubric revision. Assessing writing, 26, 51-66.
Jin, K. Y., & Eckes, T. (2022). Detecting differential rater functioning in severity and centrality: The dual DRF facets model. Educational and Psychological Measurement, 82(4), 757-781.
Johnson, J. S., & Lim, G. S. (2009). The influence of rater language background on writing performance assessment. Language Testing, 26(4), 485-505.
King, A. J., Bol, N., Cummins, R. G., & John, K. K. (2019). Improving visual behavior research in communication science: An overview, review, and reporting recommendations for using eye-tracking methods. Communication Methods and Measures, 13(3), 149-177.
Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26(2), 275-304.
Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating behavior—a longitudinal study. Language Testing, 28(2), 179-200.
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing writing, 12(1), 26-43.
Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 3(2), 1-49.
Linacre, J. M. (2004). Optimizing rating scale effectiveness. In E. V. Smith & R.M. Smith (Eds.), Introduction to Rasch measurement (pp. 257–578). JAM Press.
Low, A. R. L., & Aryadoust, V. (2021). Investigating test-taking strategies in listening assessment: A comparative study of eye-tracking and self-report questionnaires. International Journal of Listening, 35(1), 1-20.
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246-276.
Lumley, T. (2005). Assessing second language writing: The rater’s perspective. P. Lang.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language testing, 12(1), 54-71.
Luoma, S. (2004). Assessing speaking. Cambridge University Press.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of applied measurement, 4(4), 386-422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of applied measurement, 5(2), 189-227.
Rayner, K. (1978). Eye movements in reading and information processing. Psychological bulletin, 85(3), 618.
Rayner, K. (2009). Eye movements in reading: Models and data. Journal of eye movement research, 2(5), 1.
Saito, H. (2008). EFL classroom peer assessment: Training effects on rating and commenting. Language testing, 25(4), 553-581.
Saslow, J., & Ascher, A. (2015). Top notch (3rd ed.). Pearson Education.
Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465-493.
Shin, Y. S. (2009). A FACETS analysis of rater characteristics and rater bias in measuring L2 writing performance. English Language & Literature Teaching, 16(1), 123-142.
Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76(1), 27-33.
Stewart, A. J., Pickering, M. J., & Sturt, P. (2004). Using eye movements during reading as an implicit measure of the acceptability of brand extensions. Applied Cognitive Psychology: The Official Journal of the Society for Applied Research in Memory and Cognition, 18(6), 697-709.
Suto, I. (2012). A critical review of some qualitative research methods used to explore rater cognition. Educational Measurement: Issues and Practice, 31(3), 21-30.
Vaughan. C. (1991). Holistic assessment: What goes on in the rater's mind? In L. Hamp-Lyons (Ed.) Assessing second language writing in academic contexts, 111-125.
Wang, J., & Engelhard Jr, G. (2019). Exploring the impersonal judgments and personal preferences of raters in rater-mediated assessments with unfolding models. Educational and Psychological Measurement, 79(4), 773-795.
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223.
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287.
Weigle, S. C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6(2), 145-178.
Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
Wind, S. A. (2019a). A nonparametric procedure for exploring differences in rating quality across test-taker subgroups in rater-mediated writing assessments. Language Testing, 36(4), 595-616.
Wind, S. A. (2019b). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159-171.
Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161-192.
Winke, P., & Brunfaut, T. (Eds.). (2021). The Routledge handbook of second language acquisition and language testing. Routledge.
Winke, P., & Lim, H. (2015). ESL essay raters’ cognitive processes in applying the Jacobs et al. rubric: An eye-movement study. Assessing Writing, 25, 38-54.
Wolfe, E. W. (1997). The relationship between essay reading style and scoring proficiency in a psychometric scoring system. Assessing Writing, 4(1), 83-106.
Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501–527.
Youn, S. J. (2018). Rater variability across examinees and rating criteria in paired speaking assessment. Papers in Language Testing and Assessment, 7(1), 32-60.