This sub-project aims to facilitate the use and reuse of data of legal or medico-legal origin by developing pseudonymisation protocols and tools.
To date, it has two experimental paths.
Automatic protocols will be identified and tested to:
- identify the words or MultiWords (MW) to be obscured / deleted (tax codes and Names, including those of lawyers);
- annotate the MWs with grammatical tags;
- overwrite with OMISSIS and write in the lemma the typology (tax code, name);
- re-export the reconstructed Corpus;
- import the texts into a new Corpus (pseudo-anonymized).
The procedures will take advantage of the spelling correction functions already present in the TALTAC software. The reliability of the procedures will be tested on a random sample of judgments, up to an error threshold of less than 5%.
Supervised Machine Learning
The second path of experimentation involves the development of
Named Entity Recognition (NER) algorithms based on Deep Learning.
For the construction of the algorithm the following steps will be followed: definition of the anonymization protocol; selection of a representative sample of legal documents; annotation of the entities to be anonymized through a web application; development of the NER model (based on NLP libraries) and evaluation of the performance in terms of accuracy, precision and recall.
The entities automatically identified by the final NER model can be anonymized and/or pseudonymized and attached in the semantic data format of the project (LS-JSON).