Teaching English Language

Teaching English Language

Rater Training Through Eye-Tracking: A Case-Study of a Novice Rater

Document Type : Original Article

Authors
1 Department of Foreign Languages and Linguistics, School of Literature and Humanities, Shiraz University
2 Faculty of Foreign Languages, Shiraz University
10.22132/tel.2025.472698.1668
Abstract
This study centered around the notion of rater training with the help of eye-tracking systems. A novice rater participated in a rater training program which was informed by tracking the rater’s eye movements. Immediately after rating a sample of essay in each session, the rater was provided with eye-tracking feedback in the form of a heat-map produced based on his eye movements. The heat map was discussed to help the rater understand his behavior during the rating and to pinpoint which rubric descriptors and essay parts the rater noticed more while rating. The findings revealed that in the early sessions, the rater was influenced by the primacy effect; that is, he was mostly focusing on the two first criteria (content and organization). Furthermore, initially, he had struggles deciding on a band score and dedicated considerable attention to the scores rather than descriptors. However, after sessions of training the rater appeared to modify his behavior and tried to focus on all criteria and the equivalent descriptors. The findings can assist rater trainers in organizing rating programs more effectively by employing eye-tracking systems to scrutinize raters’ behavior.
Keywords

Ary, D., Jacobs, L. C., Irvine, C. K. S., & Walker, D. (2018). Introduction to research in education (10th Ed.). Cengage Learning.
Ashraf, H., Sodergren, M. H., Merali, N., Mylonas, G., Singh, H., & Darzi, A. (2018). Eye-tracking technology in medical education: A systematic review. Medical teacher40(1), 62-69.
Bachman, L. F. (1990). Fundamental considerations in language testing. Oxford University Press.
Ballard, L. (2017). The effects of primacy on rater cognition: An eye-tracking study. Michigan State University.
Barkaoui, K. (2010). Variability in ESL essay rating processes: The role of the rating scale and rater experience. Language Assessment Quarterly7(1), 54-74.
Bejar, I. I., Williamson, D. M., & Mislevy, R. J. (2006). Human scoring. Automated scoring of complex tasks in computer-based testing, 49-82.
Chen, K. T., Prouzeau, A., Langmead, J., Whitelock-Jones, R. T., Lawrence, L., Dwyer, T., ... & Goodwin, S. (2023, May). Gazealytics: A Unified and Flexible Visual Toolkit for Exploratory and Comparative Gaze Analysis. In Proceedings of the 2023 Symposium on Eye Tracking Research and Applications (pp. 1-7). Preprint available at arXiv:2303.17202.
Conklin, K. & Pellicer-Sánchez, A. (2016). Using eye-tracking in applied linguistics and second language acquisition research. Second Language Research, 32(3), 453-467.
Cumming, A. (1990). Expertise in evaluating second-language compositions. Language Testing7(1), 31-51.
Cumming, A., Kantor, R., & Powers, D. E. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive framework. The Modern Language Journal86(1), 67-96.
DeRemer, M. (1998). Writing assessment: Raters’ elaboration of the rating task. Assessing Writing, 5, 7–29.
Deygers, B., & Van Gorp, K. (2015). Determining the scoring validity of a co-constructed CEFR-based rating scale. Language Testing32(4), 521-541.
Diederich, P. B., French, J. W., & Carlton, S. T. (1961). Factors in judgments of writing ability. ETS Research Bulletin Series1961(2), i-93.
Dogan, C. D., & Uluman, M. (2017). A Comparison of Rubrics and Graded Category Rating Scales with Various Methods Regarding Raters' Reliability. Educational sciences: Theory and practice17(2), 631-651.
Eckes, T. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly, 2(3), 197–221.
Eckes, T. (2015). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments (2nd ed.). Peter Lang.
Eckstein, G., Casper, R., Chan, J., & Blackwell, L. (2018). Assessment of L2 student writing: Does teacher disciplinary background matter? Journal of Writing Research10(1), 1-23.
Elder, C., Knoch, U., Barkhuizen, G., & Von Randow, J. (2005). Individual feedback to enhance rater training: Does it work?. Language Assessment Quarterly: An International Journal2(3), 175-196.
Elder, C., Barkhuizen, G., Knoch, U., & Von Randow, J. (2007). Evaluating rater responses to an online training program for L2 writing assessment. Language Testing24(1), 37-64.
Engelhard Jr, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge.
Erguvan, I. D., & DÜNYA, B. A. (2021). Gathering evidence on e-rubrics: Perspectives and many facet Rasch analysis of rating behavior. International Journal of Assessment Tools in Education8(2), 454-474.
Erlam, R., von Randow, J., & Read, J. (2013). Investigating an online rater training program: product and process. Papers in Language Testing and Assessment2(1), 1-29.
Godfroid, A. (2019). Investigating instructed second language acquisition using L2 learners’ eye-tracking data. In The Routledge handbook of second language research in classroom learning (pp. 44-57). Routledge.
Godfroid, A., & Spino, L. A. (2015). Reconceptualizing reactivity of think‐alouds and eye tracking: Absence of evidence is not evidence of absence. Language Learning65(4), 896-928.
Godfroid, A., Winke, P., & Conklin, K. (2020). Exploring the depths of second language processing with eye tracking: An introduction. Second Language Research36(3), 243-255.
Gyamfi, G., Hanna, B. E., & Khosravi, H. (2022). The effects of rubrics on evaluative judgement: a randomised controlled experiment. Assessment & Evaluation in Higher Education47(1), 126-143.
Hamp-Lyons, L. (2007). Worrying about rating. Assessing Writing1(12), 1-9.
Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing17(4), 228-250.
Jacobs, H., Zinkgraf, S., Wormuth, D., Hartfiel, V., & Hughey, J. (1981). Testing ESL composition: A practical approach. Rowley. Newbury House.
Janssen, G., Meier, V., & Trace, J. (2015). Building a better rubric: Mixed methods rubric revision. Assessing writing26, 51-66.
Jin, K. Y., & Eckes, T. (2022). Detecting differential rater functioning in severity and centrality: The dual DRF facets model. Educational and Psychological Measurement82(4), 757-781.
Johnson, J. S., & Lim, G. S. (2009). The influence of rater language background on writing performance assessment. Language Testing26(4), 485-505.
King, A. J., Bol, N., Cummins, R. G., & John, K. K. (2019). Improving visual behavior research in communication science: An overview, review, and reporting recommendations for using eye-tracking methods. Communication Methods and Measures13(3), 149-177.
Knoch, U. (2009). Diagnostic assessment of writing: A comparison of two rating scales. Language Testing26(2), 275-304.
Knoch, U. (2011). Investigating the effectiveness of individualized feedback to rating behavior—a longitudinal study. Language Testing28(2), 179-200.
Knoch, U., Read, J., & von Randow, J. (2007). Re-training writing raters online: How does it compare with face-to-face training? Assessing writing12(1), 26-43.
Li, Y., Wei, C., & Ma, T. (2019). Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems3(2), 1-49.
Linacre, J. M. (2004). Optimizing rating scale effectiveness. In E. V. Smith & R.M. Smith (Eds.), Introduction to Rasch measurement (pp. 257–578). JAM Press.
Low, A. R. L., & Aryadoust, V. (2021). Investigating test-taking strategies in listening assessment: A comparative study of eye-tracking and self-report questionnaires. International Journal of Listening, 35(1), 1-20.
Lumley, T. (2002). Assessment criteria in a large-scale writing test: What do they really mean to the raters? Language Testing19(3), 246-276.
Lumley, T. (2005). Assessing second language writing: The rater’s perspective. P. Lang.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language testing12(1), 54-71.
Luoma, S. (2004). Assessing speaking. Cambridge University Press.
Myford, C. M., & Wolfe, E. W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of applied measurement4(4), 386-422.
Myford, C. M., & Wolfe, E. W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of applied measurement5(2), 189-227.
Rayner, K. (1978). Eye movements in reading and information processing. Psychological bulletin85(3), 618.
Rayner, K. (2009). Eye movements in reading: Models and data. Journal of eye movement research2(5), 1.
Saito, H. (2008). EFL classroom peer assessment: Training effects on rating and commenting. Language testing25(4), 553-581.
Saslow, J., & Ascher, A. (2015). Top notch (3rd ed.). Pearson Education.
Schaefer, E. (2008). Rater bias patterns in an EFL writing assessment. Language Testing25(4), 465-493.
Shin, Y. S. (2009). A FACETS analysis of rater characteristics and rater bias in measuring L2 writing performance. English Language & Literature Teaching16(1), 123-142.
Shohamy, E., Gordon, C. M., & Kraemer, R. (1992). The effect of raters’ background and training on the reliability of direct writing tests. The Modern Language Journal76(1), 27-33.
Stewart, A. J., Pickering, M. J., & Sturt, P. (2004). Using eye movements during reading as an implicit measure of the acceptability of brand extensions. Applied Cognitive Psychology: The Official Journal of the Society for Applied Research in Memory and Cognition18(6), 697-709.
Suto, I. (2012). A critical review of some qualitative research methods used to explore rater cognition. Educational Measurement: Issues and Practice31(3), 21-30.
Vaughan. C. (1991). Holistic assessment: What goes on in the rater's mind? In L. Hamp-Lyons (Ed.) Assessing second language writing in academic contexts, 111-125.
Wang, J., & Engelhard Jr, G. (2019). Exploring the impersonal judgments and personal preferences of raters in rater-mediated assessments with unfolding models. Educational and Psychological Measurement79(4), 773-795.
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing11(2), 197-223.
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing15(2), 263-287.
Weigle, S. C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative approaches. Assessing Writing6(2), 145-178.
Weigle, S. C. (2002). Assessing writing. Cambridge University Press.
Wind, S. A. (2019a). A nonparametric procedure for exploring differences in rating quality across test-taker subgroups in rater-mediated writing assessments. Language Testing36(4), 595-616.
Wind, S. A. (2019b). Examining the impacts of rater effects in performance assessments. Applied Psychological Measurement43(2), 159-171.
Wind, S. A., & Peterson, M. E. (2018). A systematic review of methods for evaluating rating quality in language assessment. Language Testing35(2), 161-192.
Winke, P., & Brunfaut, T. (Eds.). (2021). The Routledge handbook of second language acquisition and language testing. Routledge.
Winke, P., & Lim, H. (2015). ESL essay raters’ cognitive processes in applying the Jacobs et al. rubric: An eye-movement study. Assessing Writing25, 38-54.
Wolfe, E. W. (1997). The relationship between essay reading style and scoring proficiency in a psychometric scoring system. Assessing Writing4(1), 83-106.
Yan, X. (2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501–527.
Youn, S. J. (2018). Rater variability across examinees and rating criteria in paired speaking assessment. Papers in Language Testing and Assessment7(1), 32-60.
Volume 19, Issue 2
July 2025
Pages 511-539

  • Receive Date 10 August 2024
  • Revise Date 19 August 2025
  • Accept Date 22 August 2025