Rater Training Through Eye-Tracking: A Case-Study of a Novice Rater

Tavazoei, Mina; Ahmadi, Alireza

doi:10.22132/tel.2025.472698.1668

Rater Training Through Eye-Tracking: A Case-Study of a Novice Rater

Document Type : Original Article

Authors

Mina Tavazoei ¹

Alireza Ahmadi ²

¹ Department of Foreign Languages and Linguistics, School of Literature and Humanities, Shiraz University

² Faculty of Foreign Languages, Shiraz University

10.22132/tel.2025.472698.1668

Abstract

This study centered around the notion of rater training with the help of eye-tracking systems. A novice rater participated in a rater training program which was informed by tracking the rater’s eye movements. Immediately after rating a sample of essay in each session, the rater was provided with eye-tracking feedback in the form of a heat-map produced based on his eye movements. The heat map was discussed to help the rater understand his behavior during the rating and to pinpoint which rubric descriptors and essay parts the rater noticed more while rating. The findings revealed that in the early sessions, the rater was influenced by the primacy effect; that is, he was mostly focusing on the two first criteria (content and organization). Furthermore, initially, he had struggles deciding on a band score and dedicated considerable attention to the scores rather than descriptors. However, after sessions of training the rater appeared to modify his behavior and tried to focus on all criteria and the equivalent descriptors. The findings can assist rater trainers in organizing rating programs more effectively by employing eye-tracking systems to scrutinize raters’ behavior.

Keywords

rater training

essay scoring

eye-tracking

cognitive process

novice rater

Ary, D., Jacobs, L. C., Irvine, C. K. S., & Walker, D. (2018). Introduction to research in education (10^th Ed.). Cengage Learning.

Ashraf, H., Sodergren, M. H., Merali, N., Mylonas, G., Singh, H., & Darzi, A. (2018). Eye-tracking technology in medical education: A systematic review. Medical teacher, 40(1), 62-69.

Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press.

Ballard, L. (2017). The effects of primacy on rater cognition: An eye-tracking study. Michigan State University.

Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly, 7(1), 54-74.

Bejar, I. I., Williamson, D. M., & Mislevy, R. J. (2006). Human scoring. Automated scoring of complex tasks in computer-based testing, 49-82.

Chen, K. T., Prouzeau, A., Langmead, J., Whitelock-Jones, R. T., Lawrence, L., Dwyer, T., ... & Goodwin, S. (2023, May). Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Comparative Gaze Analysis. In Proceedings of the 2023 Symposium on Eye Tracking Research and Applications (pp. 1-7). Preprint available at arXiv:2303.17202.

Conklin, K. & Pellicer-Sánchez, A. (2016). Using eye-tracking in applied linguistics and second language acquisition research. Second Language Research, 32(3), 453-467.

Cumming, A. (1990). Expertise in evaluating second-language compositions. Language Testing, 7(1), 31-51.

Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal, 86(1), 67-96.

DeRemer, M. (1998). Writing assessment: Raters’ elaboration of the rating task. Assessing Writing, 5, 7–29.

Deygers, B., & Van Gorp, K. (2015). Determining the scoring validity of a co-constructed CEFR-based rating scale. Language Testing, 32(4), 521-541.

Diederich, P. B., French, J. W., & Carlton, S. T. (1961). Factors in judgments of writing ability. ETS Research Bulletin Series, 1961(2), i-93.

Dogan, C. D., & Uluman, M. (2017). A Comparison of Rubrics and Graded Category Rating Scales with Various Methods Regarding Raters' Reliability. Educational sciences: Theory and practice, 17(2), 631-651.

Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221.

Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2^nd ed.). Peter Lang.

Eckstein, G., Casper, R., Chan, J., & Blackwell, L. (2018). Assessment of L2 student writing: Does teacher disciplinary background matter? Journal of Writing Research, 10(1), 1-23.

Elder, C., Knoch, U., Barkhuizen, G., & Von Randow, J. (2005). Individual feedback to enhance rater training: Does it work?. Language Assessment Quarterly: An International Journal, 2(3), 175-196.

Elder, C., Barkhuizen, G., Knoch, U., & Von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing, 24(1), 37-64.

Engelhard Jr, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge.

Erguvan, I. D., & DÜNYA, B. A. (2021). Gathering evidence on e-rubrics: Perspectives and many facet Rasch analysis of rating behavior. International Journal of Assessment Tools in Education, 8(2), 454-474.

Erlam, R., von Randow, J., & Read, J. (2013). Investigating an online rater training program: product and process. Papers in Language Testing and Assessment, 2(1), 1-29.

Godfroid, A. (2019). Investigating instructed second language acquisition using L2 learners’ eye-tracking data. In The Routledge handbook of second language research in classroom learning (pp. 44-57). Routledge.

Godfroid, A., & Spino, L. A. (2015). Reconceptualizing reactivity of think‐alouds and eye tracking: Absence of evidence is not evidence of absence. Language Learning, 65(4), 896-928.

Godfroid, A., Winke, P., & Conklin, K. (2020). Exploring the depths of second language processing with eye tracking: An introduction. Second Language Research, 36(3), 243-255.

Gyamfi, G., Hanna, B. E., & Khosravi, H. (2022). The effects of rubrics on evaluative judgement: a randomised controlled experiment. Assessment & Evaluation in Higher Education, 47(1), 126-143.

Hamp-Lyons, L. (2007). Worrying about rating. Assessing Writing, 1(12), 1-9.

Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17(4), 228-250.

Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL composition: A practical approach. Rowley. Newbury House.

Janssen, G., Meier, V., & Trace, J. (2015). Building a better rubric: Mixed methods rubric revision. Assessing writing, 26, 51-66.

Jin, K. Y., & Eckes, T. (2022). Detecting differential rater functioning in severity and centrality: The dual DRF facets model. Educational and Psychological Measurement, 82(4), 757-781.

Johnson, J. S., & Lim, G. S. (2009). The influence of rater language background on writing performance assessment. Language Testing, 26(4), 485-505.

King, A. J., Bol, N., Cummins, R. G., & John, K. K. (2019). Improving visual behavior research in communication science: An overview, review, and reporting recommendations for using eye-tracking methods. Communication Methods and Measures, 13(3), 149-177.

Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing, 26(2), 275-304.

Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating behavior—a longitudinal study. Language Testing, 28(2), 179-200.

Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing writing, 12(1), 26-43.

Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 3(2), 1-49.

Linacre, J. M. (2004). Optimizing rating scale effectiveness. In E. V. Smith & R.M. Smith (Eds.), Introduction to Rasch measurement (pp. 257–578). JAM Press.

Low, A. R. L., & Aryadoust, V. (2021). Investigating test-taking strategies in listening assessment: A comparative study of eye-tracking and self-report questionnaires. International Journal of Listening, 35(1), 1-20.

Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing, 19(3), 246-276.

Lumley, T. (2005). Assessing second language writing: The rater’s perspective. P. Lang.

Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language testing, 12(1), 54-71.

Luoma, S. (2004). Assessing speaking. Cambridge University Press.

Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of applied measurement, 4(4), 386-422.

Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of applied measurement, 5(2), 189-227.

Rayner, K. (1978). Eye movements in reading and information processing. Psychological bulletin, 85(3), 618.

Rayner, K. (2009). Eye movements in reading: Models and data. Journal of eye movement research, 2(5), 1.

Saito, H. (2008). EFL classroom peer assessment: Training effects on rating and commenting. Language testing, 25(4), 553-581.

Saslow, J., & Ascher, A. (2015). Top notch (3rd ed.). Pearson Education.

Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465-493.

Shin, Y. S. (2009). A FACETS analysis of rater characteristics and rater bias in measuring L2 writing performance. English Language & Literature Teaching, 16(1), 123-142.

Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal, 76(1), 27-33.

Stewart, A. J., Pickering, M. J., & Sturt, P. (2004). Using eye movements during reading as an implicit measure of the acceptability of brand extensions. Applied Cognitive Psychology: The Official Journal of the Society for Applied Research in Memory and Cognition, 18(6), 697-709.

Suto, I. (2012). A critical review of some qualitative research methods used to explore rater cognition. Educational Measurement: Issues and Practice, 31(3), 21-30.

Vaughan. C. (1991). Holistic assessment: What goes on in the rater's mind? In L. Hamp-Lyons (Ed.) Assessing second language writing in academic contexts, 111-125.

Wang, J., & Engelhard Jr, G. (2019). Exploring the impersonal judgments and personal preferences of raters in rater-mediated assessments with unfolding models. Educational and Psychological Measurement, 79(4), 773-795.

Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11(2), 197-223.

Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15(2), 263-287.

Weigle, S. C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing, 6(2), 145-178.

Weigle, S. C. (2002). Assessing writing. Cambridge University Press.

Wind, S. A. (2019a). A nonparametric procedure for exploring differences in rating quality across test-taker subgroups in rater-mediated writing assessments. Language Testing, 36(4), 595-616.

Wind, S. A. (2019b). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement, 43(2), 159-171.

Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing, 35(2), 161-192.

Winke, P., & Brunfaut, T. (Eds.). (2021). The Routledge handbook of second language acquisition and language testing. Routledge.

Winke, P., & Lim, H. (2015). ESL essay raters’ cognitive processes in applying the Jacobs et al. rubric: An eye-movement study. Assessing Writing, 25, 38-54.

Wolfe, E. W. (1997). The relationship between essay reading style and scoring proficiency in a psychometric scoring system. Assessing Writing, 4(1), 83-106.

Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501–527.

Youn, S. J. (2018). Rater variability across examinees and rating criteria in paired speaking assessment. Papers in Language Testing and Assessment, 7(1), 32-60.

Volume 19, Issue 2
July 2025
Pages 511-539

XML

PDF 494.04 K

Receive Date 10 August 2024
Revise Date 19 August 2025
Accept Date 22 August 2025

Article View 209
PDF Download 150

Teaching English Language

Rater Training Through Eye-Tracking: A Case-Study of a Novice Rater

Volume 19, Issue 2July 2025Pages 511-539

Files

History

Share

How to cite

Statistics

Volume 19, Issue 2
July 2025
Pages 511-539