Formulaic expressions are foundational elements of academic discourse, helping writers structure arguments and ensure clarity. Students, especially early in their academic careers, often rely on such phrases to express purpose, signal transitions, or cite prior work. While these expressions support coherence, excessive reliance can reduce rhetorical originality. This study provides an expanded statistical perspective on the frequency of MWPs, with emphasis on discipline-based variation, distribution across document lengths, and distinctions between undergraduate and graduate writing.
Corpus and Methodology
The dataset encompasses five thousand anonymized student papers ranging from eight hundred to five thousand words, with a median length of approximately two thousand one hundred. The corpus includes two thousand Humanities papers, two thousand STEM papers, and one thousand Social Sciences papers. A computational extraction method identified multiword expressions composed of three to six words, and each phrase was counted once per paper to reduce inflation caused by repetition. Phrase density was measured per one thousand words, and statistical tools such as variance, standard deviation, correlation analysis, and regression modeling were applied to identify consistent patterns.
Results
The analysis confirmed that a small cluster of phrases dominates the corpus, with nearly half of all papers containing at least one of the most recurrent expressions. Table 1 presents the five most frequently borrowed academic phrases and their prevalence within the dataset. The table shows that “in this paper, I argue” appears in 12.4 percent of all papers, while the combined presence of all top five expressions reaches 44.06 percent of the corpus.
| Phrase | Number of Papers | Prevalence (%) |
|---|---|---|
| in this paper, I argue | 620 | 12.40% |
| the purpose of this study | 488 | 9.76% |
| previous research has shown | 430 | 8.60% |
| it is important to note | 365 | 7.30% |
| the results suggest that | 300 | 6.00% |
These figures demonstrate the extent to which academic convention shapes repetitive linguistic behavior. The density of MWPs across the full dataset averaged 7.8 phrases per one thousand words, with a median of 6.9 and a standard deviation of 3.1. Writers in the top ten percent of density exceeded fourteen MWPs per one thousand words, and the highest-density segment of the corpus, representing approximately 3.4 percent of all papers, surpassed eighteen. These outlying papers frequently originate from introductory-level writing courses where students are trained to rely on explicit rhetorical markers such as metadiscourse and signaling phrases.
Disciplinary differences were pronounced. Humanities papers exhibited the highest use of author-centered framing such as “in this paper, I argue,” which appeared in sixteen percent of Humanities submissions. In STEM fields, attribution-oriented expressions such as “previous research has shown” appeared in 10.5 percent of papers, compared to only six percent in the Humanities. Social Sciences papers fell between these two extremes but closer to STEM, likely due to the methodological and literature-review orientation typical of the field. Variance analysis showed that phrase frequency ranged most widely within STEM papers, with a variance score of 0.91, while Humanities papers displayed the lowest variance of 0.62. These disciplinary differences suggest distinct rhetorical norms and differing reliance on conventional phrasing.
The relationship between document length and phrase frequency was examined using Spearman’s correlation coefficient. The coefficient of 0.31, with significance below 0.01, indicates a moderate positive correlation, suggesting that longer papers tend to contain more MWPs. Texts exceeding 3,500 words averaged 9.4 MWPs per one thousand words, whereas shorter papers under 1,500 words averaged 7.2. Longer papers, by virtue of their more complex argumentative structures, appear to encourage repeated reliance on standardized phrasing.
Academic level was another important variable. A linear regression model based on a subsample of 600 undergraduate and 600 graduate papers showed a statistically significant negative coefficient of –0.17 for the predictor variable of academic level. Graduate students used fewer MWPs, averaging 6.7 per one thousand words, whereas undergraduates averaged 8.4. These results suggest that developing writers depend more on formulaic language, while experienced writers have greater confidence in shaping linguistic structures independently of prescribed templates.
The distribution of phrase frequencies followed a predictable linguistic pattern. A Zipf-like distribution emerged when phrase rank was plotted against log-frequency, producing a correlation of –0.92. This power-law behavior indicates that a small number of phrases dominate, while hundreds of others appear much less frequently. The top fifty MWPs accounted for approximately seventy-two percent of all phrase occurrences across the corpus. Semantic clustering identified metadiscourse, attribution, interpretive hedging, and methodological reporting as primary functional groups. Metadiscourse dominated, accounting for thirty-four percent of all occurrences, reflecting students’ reliance on explicit signposting. Humanities papers displayed particularly strong concentration in this cluster, while STEM papers showed the highest concentration of methodological expressions.
Discussion
The findings illuminate the tension between convention and creativity in student writing. While MWPs provide stability and clarity, their excessive use diminishes rhetorical originality and may mask underdeveloped argumentative skills. The high prevalence of metadiscourse and attribution phrases suggests that students lean heavily on learned templates rather than experimenting with more nuanced or individualized rhetorical strategies. Disciplinary differences highlight how academic traditions shape linguistic patterns, and distinctions between undergraduate and graduate writing emphasize the importance of experience in developing a distinct academic voice.
Conclusion
This statistical investigation demonstrates that borrowed phrases are deeply embedded in student writing and that their frequency follows identifiable quantitative patterns shaped by discipline, document length, and academic level. Understanding these patterns may help educators better discern when formulaic expressions reflect legitimate disciplinary practice and when they indicate overreliance. Future studies incorporating semantic embedding and syntactic variation analysis may reveal whether reduced phrase borrowing correlates with more sophisticated conceptual development.