Hybrid Methods for Automatic Collocation Extraction in Building a Learners’ Dictionary of Italian

IRIS

The automatic construction of learners’ dictionaries requires robust methods for identifying non-literal word combinations, or collocations, which represent a significant challenge for second-language (L2) learners. This paper addresses the critical initial step of accurately extracting collocation candidates from corpora to build a learner’s dictionary for Italian. The adopted method and the implemented application are significant for learning the Italian language. We present a comparative study of three methodologies for identifying these candidates within a 41.7-million-word Italian corpus: a Part-Of-Speech-based approach, a syntactic dependency-based approach, and a novel Hybrid method that integrates both. The analysis yielded 2,097,595 potential collocations. Results indicate that the Hybrid method achieves superior performance in terms of Recall and Benchmark Match, identifying the most significant portion of candidates, 42.35% of the total. We conducted an in-depth analysis to refine the extracted dataset, calculating multiple statistical metrics for each candidate, which are described in detail in the paper. Such analysis allows for the classification of collocations by relevance, difficulty, and frequency of use, forming the basis for the future development of a high-quality, web-based dictionary tailored to the proficiency levels of Italian learners.

Hybrid Methods for Automatic Collocation Extraction in Building a Learners’ Dictionary of Italian

Perri, Damiano;Gervasi, Osvaldo;Tasso, Sergio;Spina, Stefania;Fioravanti, Irene;Zanda, Fabio;Forti, Luciana

2025-01-01

Abstract

The automatic construction of learners’ dictionaries requires robust methods for identifying non-literal word combinations, or collocations, which represent a significant challenge for second-language (L2) learners. This paper addresses the critical initial step of accurately extracting collocation candidates from corpora to build a learner’s dictionary for Italian. The adopted method and the implemented application are significant for learning the Italian language. We present a comparative study of three methodologies for identifying these candidates within a 41.7-million-word Italian corpus: a Part-Of-Speech-based approach, a syntactic dependency-based approach, and a novel Hybrid method that integrates both. The analysis yielded 2,097,595 potential collocations. Results indicate that the Hybrid method achieves superior performance in terms of Recall and Benchmark Match, identifying the most significant portion of candidates, 42.35% of the total. We conducted an in-depth analysis to refine the extracted dataset, calculating multiple statistical metrics for each candidate, which are described in detail in the paper. Such analysis allows for the classification of collocations by relevance, difficulty, and frequency of use, forming the basis for the future development of a high-quality, web-based dictionary tailored to the proficiency levels of Italian learners.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Parole chiave
	
				Artificial Intelligence, Natural Language Processing, language learning, collocations, part-of-speech, syntactic dependency, L2 learners
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
computers-14-00552.pdf accesso aperto Tipologia: Versione Editoriale (PDF) Licenza: Creative commons Dimensione 980.72 kB Formato Adobe PDF Visualizza/Apri	980.72 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12071/51352

Citazioni

ND

social impact