1
|
Erjavec T, Ogrodniczuk M, Osenova P, Ljubešić N, Simov K, Pančur A, Rudolf M, Kopp M, Barkarson S, Steingrímsson S, Çöltekin Ç, de Does J, Depuydt K, Agnoloni T, Venturi G, Pérez MC, de Macedo LD, Navarretta C, Luxardo G, Coole M, Rayson P, Morkevičius V, Krilavičius T, Darǵis R, Ring O, van Heusden R, Marx M, Fišer D. The ParlaMint corpora of parliamentary proceedings. LANG RESOUR EVAL 2023; 57:415-448. [PMID: 35125984 PMCID: PMC8807381 DOI: 10.1007/s10579-021-09574-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/20/2021] [Indexed: 11/30/2022]
Abstract
This paper presents the ParlaMint corpora containing transcriptions of the sessions of the 17 European national parliaments with half a billion words. The corpora are uniformly encoded, contain rich meta-data about 11 thousand speakers, and are linguistically annotated following the Universal Dependencies formalism and with named entities. Samples of the corpora and conversion scripts are available from the project's GitHub repository, and the complete corpora are openly available via the CLARIN.SI repository for download, as well as through the NoSketch Engine and KonText concordancers and the Parlameter interface for on-line exploration and analysis.
Collapse
Affiliation(s)
- Tomaž Erjavec
- Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia
| | - Maciej Ogrodniczuk
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
| | - Petya Osenova
- Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, and Sofia University “St. Kl. Ohridski”, Sofia, Bulgaria
| | - Nikola Ljubešić
- Department of Knowledge Technologies, Jožef Stefan Institute and Faculty of Computer Science and Informatics, University of Ljubljana, Ljubljana, Slovenia
| | - Kiril Simov
- Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Sofia, Bulgaria
| | - Andrej Pančur
- Institute for Contemporay History, Ljubljana, Slovenia
| | - Michał Rudolf
- Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland
| | - Matyáš Kopp
- Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic
| | | | | | | | | | | | - Tommaso Agnoloni
- Institute of Legal Informatics and Judicial Systems CNR-IGSG, Florence, Italy
| | - Giulia Venturi
- Institute of Computational Linguistics CNR-ILC, Pis, Italy
| | | | | | | | | | | | | | | | | | | | | | | | - Maarten Marx
- Universiteit van Amsterdam, Amsterdam, The Netherlands
| | - Darja Fišer
- Arts Faculty, University of Ljubljana, and Institute of Contemporary History, Ljubljana, Slovenia
| |
Collapse
|