This article reports from a study of interrater reliability of constructed response items in standardized tests of reading. Two panels of raters (lower secondary teachers and test developers) were asked to rate student responses on 11 different items taken from the Norwegian national reading test in eighth grade. Consensus estimates and measurement estimates were combined with a qualitative analysis of difficult-to-score student responses. Based on findings about rater agreement, distribution of severity, and troublesome response characteristics, the article provides knowledge about both actual and possible levels of interrater reliability and discusses the use and development of open-ended reading test items.

Keywords: Assessment, constructed response, interrater reliability, national tests, reading