This article introduces the PEC24, an extension of the Perugia corpus, as a new reference corpus for Italian. The update mainly concerned the size of the corpus, which now consists of approximately 47 million tokens, with an addition of over 100,000 texts. The PEC24 maintains the same structure as its predecessor, divided into 10 sections, representing ten different written and spoken genres. In this article, after reviewing the spoken, written, and web corpora available for the Italian language, the internal composition of each section of the corpus will be described, followed by an explanation of how the corpus was annotated. Further, as the PEC24 is available and searchable online, examples of how it can be queried will be illustrated. In conclusion, the PEC24 represents a significant advancement in the panorama of Italian corpora, offering a representative and more comprehensive resource for linguistic research and corpus-bases studies.

From PEC to PEC24: a new reference corpus for Italian

Spina S
;
Zanda F;Fioravanti I
2025-01-01

Abstract

This article introduces the PEC24, an extension of the Perugia corpus, as a new reference corpus for Italian. The update mainly concerned the size of the corpus, which now consists of approximately 47 million tokens, with an addition of over 100,000 texts. The PEC24 maintains the same structure as its predecessor, divided into 10 sections, representing ten different written and spoken genres. In this article, after reviewing the spoken, written, and web corpora available for the Italian language, the internal composition of each section of the corpus will be described, followed by an explanation of how the corpus was annotated. Further, as the PEC24 is available and searchable online, examples of how it can be queried will be illustrated. In conclusion, the PEC24 represents a significant advancement in the panorama of Italian corpora, offering a representative and more comprehensive resource for linguistic research and corpus-bases studies.
2025
corpora, Italian, reference corpus, written Italian, spiken Italian, corpus linguistics
File in questo prodotto:
File Dimensione Formato  
Spina_Zanda_Fioravanti+From+PEC+to+PEC24+a+new+reference+corpus+for+Italian_updatedef..pdf

accesso aperto

Tipologia: Versione Editoriale (PDF)
Licenza: Creative commons
Dimensione 675.45 kB
Formato Adobe PDF
675.45 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12071/47768
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
social impact