1
|
Cammarota V, Bozza S, Roten CA, Taroni F. Stylometry and forensic science: A literature review. Forensic Sci Int Synerg 2024; 9:100481. [PMID: 39781110 PMCID: PMC11707938 DOI: 10.1016/j.fsisyn.2024.100481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 05/30/2024] [Accepted: 06/07/2024] [Indexed: 01/12/2025]
Abstract
The article focuses on a careful description of literature on stylometry and on its potential use in forensic science. The state of the art of stylometry is summarized to illustrate the history and the scientific foundation of this discipline. However, the study conducted reveals that there are still some key unresolved aspects that require a response from the academic world. The paper introduces the readers to those issues that need to be tackled for stylometry to be accepted as a forensic discipline. In particular, a coherent probabilistic procedure to assess the probative value of the results obtained through this methodology is largely absent. This gap should be filled properly by applying criteria recommended by international organizations such as the European Network of Forensic Science Institutes. Solutions do exist and will allow a better integration of stylometry in forensic science, favouring the acceptance of this scientific technical method in judicial proceedings.
Collapse
Affiliation(s)
| | - Silvia Bozza
- School of Criminal Justice, University of Lausanne, Lausanne, Switzerland
- Department of Economics, Ca’ Foscari University of Venice, Venice, Italy
| | | | - Franco Taroni
- School of Criminal Justice, University of Lausanne, Lausanne, Switzerland
| |
Collapse
|
3
|
Andreas J, Beguš G, Bronstein MM, Diamant R, Delaney D, Gero S, Goldwasser S, Gruber DF, de Haas S, Malkin P, Pavlov N, Payne R, Petri G, Rus D, Sharma P, Tchernov D, Tønnesen P, Torralba A, Vogt D, Wood RJ. Toward understanding the communication in sperm whales. iScience 2022; 25:104393. [PMID: 35663036 PMCID: PMC9160774 DOI: 10.1016/j.isci.2022.104393] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022] Open
Abstract
Machine learning has been advancing dramatically over the past decade. Most strides are human-based applications due to the availability of large-scale datasets; however, opportunities are ripe to apply this technology to more deeply understand non-human communication. We detail a scientific roadmap for advancing the understanding of communication of whales that can be built further upon as a template to decipher other forms of animal and non-human communication. Sperm whales, with their highly developed neuroanatomical features, cognitive abilities, social structures, and discrete click-based encoding make for an excellent model for advanced tools that can be applied to other animals in the future. We outline the key elements required for the collection and processing of massive datasets, detecting basic communication units and language-like higher-level structures, and validating models through interactive playback experiments. The technological capabilities developed by such an undertaking hold potential for cross-applications in broader communities investigating non-human communication and behavioral research.
Collapse
Affiliation(s)
- Jacob Andreas
- MIT CSAIL, Cambridge, MA, USA
- Project CETI, New York, NY, USA
| | - Gašper Beguš
- Department of Linguistics, University of California, Berkeley, CA, USA
- Project CETI, New York, NY, USA
| | - Michael M. Bronstein
- Department of Computer Science, University of Oxford, Oxford, UK
- IDSIA, University of Lugano, Lugano, Switzerland
- Twitter, London, UK
- Project CETI, New York, NY, USA
| | - Roee Diamant
- Leon H. Charney School of Marine Sciences, University of Haifa, Haifa, Israel
- Project CETI, New York, NY, USA
| | - Denley Delaney
- Exploration Technology Lab, National Geographic Society, Washington DC, USA
- Project CETI, New York, NY, USA
| | - Shane Gero
- Dominica Sperm Whale Project, Roseau, Commonwealth of Dominica
- Department of Biology, Carleton University, Ottawa, ON, Canada
- Project CETI, New York, NY, USA
| | - Shafi Goldwasser
- Simons Institute for the Theory of Computing, University of California, Berkeley, CA, USA
| | - David F. Gruber
- Department of Natural Sciences, Baruch College and The Graduate Center, PhD Program in Biology, City University of New York, New York, NY, USA
- Project CETI, New York, NY, USA
| | - Sarah de Haas
- Google Research, Mountain View, CA USA
- Project CETI, New York, NY, USA
| | - Peter Malkin
- Google Research, Mountain View, CA USA
- Project CETI, New York, NY, USA
| | | | | | - Giovanni Petri
- ISI Foundation, Turin, Italy
- Project CETI, New York, NY, USA
| | - Daniela Rus
- MIT CSAIL, Cambridge, MA, USA
- Project CETI, New York, NY, USA
| | | | - Dan Tchernov
- Leon H. Charney School of Marine Sciences, University of Haifa, Haifa, Israel
- Project CETI, New York, NY, USA
| | - Pernille Tønnesen
- Marine Bioacoustics Lab, Zoophysiology, Department of Biology, Aarhus University, Aarhus, Denmark
- Project CETI, New York, NY, USA
| | | | - Daniel Vogt
- School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
- Project CETI, New York, NY, USA
| | - Robert J. Wood
- School of Engineering and Applied Sciences, Harvard University, Cambridge, MA, USA
- Project CETI, New York, NY, USA
| |
Collapse
|
4
|
Ryabko B, Savina N. Using Data Compression to Build a Method for Statistically Verified Attribution of Literary Texts. ENTROPY 2021; 23:e23101302. [PMID: 34682026 PMCID: PMC8534409 DOI: 10.3390/e23101302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Revised: 09/29/2021] [Accepted: 09/29/2021] [Indexed: 11/16/2022]
Abstract
We consider the problems of the authorship of literary texts in the framework of the quantitative study of literature. This article proposes a methodology for authorship attribution of literary texts based on the use of data compressors. Unlike other methods, the suggested one gives a possibility to make statistically verified results. This method is used to solve two problems of attribution in Russian literature.
Collapse
Affiliation(s)
- Boris Ryabko
- Federal Research Center for Information and Computational Technologies of SB RAS, 630090 Novosibirsk, Russia
- Department of Information Technologies, Novosibirsk State University, 630090 Novosibirsk, Russia;
- Correspondence:
| | - Nadezhda Savina
- Department of Information Technologies, Novosibirsk State University, 630090 Novosibirsk, Russia;
| |
Collapse
|
6
|
Vázquez PP. Visual Analysis of Research Paper Collections Using Normalized Relative Compression. ENTROPY 2019; 21:e21060612. [PMID: 33267326 PMCID: PMC7515106 DOI: 10.3390/e21060612] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2019] [Revised: 06/15/2019] [Accepted: 06/19/2019] [Indexed: 11/16/2022]
Abstract
The analysis of research paper collections is an interesting topic that can give insights on whether a research area is stalled in the same problems, or there is a great amount of novelty every year. Previous research has addressed similar tasks by the analysis of keywords or reference lists, with different degrees of human intervention. In this paper, we demonstrate how, with the use of Normalized Relative Compression, together with a set of automated data-processing tasks, we can successfully visually compare research articles and document collections. We also achieve very similar results with Normalized Conditional Compression that can be applied with a regular compressor. With our approach, we can group papers of different disciplines, analyze how a conference evolves throughout the different editions, or how the profile of a researcher changes through the time. We provide a set of tests that validate our technique, and show that it behaves better for these tasks than other techniques previously proposed.
Collapse
Affiliation(s)
- Pere-Pau Vázquez
- ViRVIG Group, Universitat Politècnica de Catalunya, 08034 Barcelona, Spain
| |
Collapse
|
7
|
Claude F, Galaktionov D, Konow R, Ladra S, Pedreira Ó. Competitive Author Profiling Using Compression-Based Strategies. INT J UNCERTAIN FUZZ 2017. [DOI: 10.1142/s0218488517400086] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Author profiling consists in determining some demographic attributes — such as gender, age, nationality, language, religion, and others — of an author for a given document. This task, which has applications in fields such as forensics, security, or marketing, has been approached from different areas, especially from linguistics and natural language processing, by extracting different types of features from training documents, usually content — and style-based features. In this paper we address the problem by using several compression-inspired strategies that generate different models without analyzing or extracting specific features from the textual content, making them style-oblivious approaches. We analyze the behavior of these techniques, combine them and compare them with other state-of-the-art methods. We show that they can be competitive in terms of accuracy, giving the best predictions for some domains, and they are efficient in time performance.
Collapse
Affiliation(s)
- Francisco Claude
- Universidad Diego Portales, Escuela de Informática y Telecomunicaciones, Santiago, Chile
- Sudo Technologies Inc., Menlo Park, California, USA
| | - Daniil Galaktionov
- Universidade da Coruña, Database Laboratory, Elviña, 15071, A Coruña, Spain
| | - Roberto Konow
- Universidad Diego Portales, Escuela de Informática y Telecomunicaciones, Santiago, Chile
- Universidad Diego Portales, Escuela de Informática y Telecomunicaciones, Santiago, Chile, eBay Inc., San Jose, California, USA
| | - Susana Ladra
- Universidade da Coruña, Database Laboratory, Elviña, 15071, A Coruña, Spain
| | - Óscar Pedreira
- Universidade da Coruña, Database Laboratory, Elviña, 15071, A Coruña, Spain
| |
Collapse
|
8
|
Coutinho DP, Figueiredo MAT. Text Classification Using Compression-Based Dissimilarity Measures. INT J PATTERN RECOGN 2015. [DOI: 10.1142/s0218001415530043] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
Collapse
Affiliation(s)
- David Pereira Coutinho
- Instituto de Telecomunicações and Instituto Superior de Engenharia de Lisboa (ISEL), Instituto Politécnico de Lisboa, 1959-007 Lisboa, Portugal
| | - Mário A. T. Figueiredo
- Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de Lisboa, 1049-001 Lisboa, Portugal
| |
Collapse
|