Un Lenguaje de Modelado para el Análisis del Discurso en Humanidades: Definición, Diseño, Validación y Primeras experiencias

Patricia Martin-Rodilla
patricia.martin-rodilla@incipit.csic.es
Institute of Heritage Sciences (Incipit), Spanish National Research Council (CSIC)
Cesar Gonzalez-Perez
cesar.gonzalez-perez@incipit.csic.es
Institute of Heritage Sciences (Incipit), Spanish National Research Council (CSIC)

Comunicación larga
Procesamiento del discurso


Introduction and Background

Due to humanities generally produce knowledge in textual formats, such as narrative conclusions or reports, a properly management of the humanities corpus needs methods for extracting information from textual sources.

Software engineering approaches in conceptualizing and extracting information from textual sources goes back to decades. Firstly, most of approaches are related to information retrieval [1], extracting information in a quantitative level. For instance, heuristic and probabilistictechniques extract frequency or similar i ndicators about specific elements in a text. We can also find semantic approaches inside information retrieval disciplines, analysing textual sources based on topic maps [2], thesauri, lexematization techniques [3] or sentiment analysis [4] [5]. These approaches allow extracting semantic relationships between elements, such as hierarchical relationships. However, due to the degree of automation applied, it is not possible to achieve a satisfactory level of semantic extraction for the application to more narrative contexts.

Secondly, there are existing approaches focused on modelling a specific domain, in order to achieve the desirable conceptualization and semantic extraction from textual sources. For instance, existing applications in biomedicine [6] combines conceptual modelling, annotation and natural language processing methods. These approaches are successfully applied, designing ad hoc information extraction methods for a particular domain or corpus. However, an ad hoc design involves a domain dependency of the solution created that does not allow achieving a high degree of generalization.

Currently, more linguistic and semantic approaches are becoming popular, enriching the information extraction methods from textual sources with an acceptable degree of domain independence. In particular, discourse analysis [7] is a set of techniques from linguistics used to discover semantic relations between elements in the texts based on their discursive structure. In other words, applying discourse analysis we can identify what discourse elements are present in a text (sentences, clauses…) and link them to the entities of the reality referred. In addition, we can identify what inferential relations are connecting those two parts (causal relations, exemplifications, etc.). This connection between discourse structure and elements of the reality referred in the text, as well as the inferential dimension, constitute semantic information which is not available following other extraction methods from texts.

For this reason, current studies [8][9] are working on the application of discourse analysis techniques for extracting information from textual sources. Hence, we based our work on the approach made by Hobbs [10] and subsequent work based on it, and we defined a modelling language that allows the application of discourse analysis to extract information from textual sources in humanities. This language was previously presented at [11]. In the next section we introduce the modelling language. In later sections, we present for the first time the language modelling validation and the results obtained.

The Modelling Language

In order to provide the necessary method to extract structural information from texts -not only in a quantitative level, but also in a highly semantic and inferential level-, and based on successful experiences on teaching conceptual modelling to humanities specialists, we carried out a two-year research project about the application of discourse analysis to textual sources in humanities. We have created a conceptual language that allows creating models from textual sources [11], capturing the discourse structure and extracting semantic and inferential information from them.

Let’s take an example of discourse fragment, extracted from an archaeological and historical study in Cyprus [12] available on-line [13]:

Few examples survive with both rim and base intact. Instead, the bases most commonly survive possibly because rims and sides tended to be finer and taller than those of other types and hence more fragile.

The discourse fragment is divided into two parts, separated by the link “Instead”, following Hobbs discourse analysis techniques. The first sentence is already an atomic element. The second part needs to be further divided into smaller sentences: "Instead, the bases most commonly survive” and "possibly because rims and sides tended to be finer and taller tan those of other types and hence more fragile”. The modelling language designed allows us to create an object model, identifying the entities that the discourse is talking about (parts of material evidences), their features and relationships. In addition, the language allows modelling the Explanation coherence relation -in Hobbs terms- existing between these two last sentences - the second one explains the cause for the first one -. The modelling process continues with the creation of the complete model, involving the first atomic sentence “Few examples survive with both rim and base intact”. We can identify at this point the Contrast relation -in Hobbs terms- between this first atomic sentence and the sentence S2a in Figure 1.

Figure 1.

The modelling language allows us to have in our model all the information about entities, features, relationships and coherence relations present in the discourse. An example of use could be to improve the software searching methods about material evidences, so we can find out which parts are kept and which are not, and relate them to their height and fragility.

Language validation and results

Using the proposed language, we have modelled a selected corpus of texts from historical and archaeological contexts, in order to validate the approach.

The validation process included an interview phase with the authors of the texts – following discourse analysis recommendations – and a group of sessions based on Think Aloud Protocols –TAP- [14]. TAP establishes recorded sessions with real users –in our case humanities specialists-, where the users “think aloud", identifying the cognitive processes that they perform in function of the tasks presented in each session. The purpose of the interviews and sessions TAP was to investigate:

What areas of conflict exist in both cases? The validation process has allowed us to extractinferential information from texts, such as detection of contrasts, causality or generalization and exemplifications relations. That identification would not be possible using existing approaches from software engineering. In addition, the validation results allowed us to figure out about the generalization possibilities of the models created, as well as what inferences presented a higher level of disagreement. For example, we have identified generalization relations. In these cases, we found a high degree of disagreement among the models made by the author of the text and the models made by researchers in the same discipline. Depending on whether the researcher considered particularly relevant a give example as a basis to generalize, he/she agreed or not.

Conclusions and future work

The modelling language created allows us to extract information from humanities texts, not only about what entities are referenced in a text or the discourse structure, but also what inferences and underlying argumentation is used and its connections. Furthermore, we presented here empirical results about the degree of agreement and possible generalization within the community related to a particular corpus. These results can be used in subsequent steps in many ways. Particularly relevant for future work are: (1) analysis of new corpus will allow us to implement mechanisms to detect inconsistencies and other functionalities presented above and will encourage self-reflection inside the disciplines of the corpus analysed, knowing more about how knowledge is generated using narrative formats in disciplines in humanities. This information about the knowledge generation process is crucial in the development of software systems for humanities; (2) the application of the extracted information to the expansion and improvement of existing annotation systems, including inferential information, will enrich the corpus analysis; and (3) the detection of relations between entities and underlying inferences is an initial step towards the study of the potential of knowledge discovery and data mining in humanities texts. For example, the detection of hidden causalities in texts opens up the application of existing methods of semi-automatic data-mining based on the causal mechanism, such as the association rules [15]. This connection has been already pointed out in previous studies [16], although it is still at an early stage of development.

References

[1] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval: Addison-Wesley Longman Publishing Co, 1999.

[2] ISO/IEC, 2006. Topics Maps. ISO/IEC 13250/2006.

[3] J. M. Torres Moreno. "Reagrupamiento en familias y lexematización automática independientes del idioma". Inteligencia Artificial, Revista Iberoamericana de Inteligencia Artificial, vol. 14, pp. 38-53, 2010.

[4] B. Pang, L. Lee and S. Vaithyanathan. "Thumbs up? Sentiment Classification using Machine Learning Techniques". Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 79–86. 2002.

[5] D. Borth, R. Ji, T. Chen, T. Breuel and S.Chang. "Large-scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs". Proceedings of the 21st ACM international conference on Multimedia, pp. 223-232. doi:10.1145/2502081.2502282. 2013.

[6] L. J. Jensen, J. Saric and P. Bork. "Literature mining for the biologist from information retrieval to biological discovery". Nature Reviews Genetics, vol.7, pp. 119-129, 2006.

[7] Z. S. Harris. "Discourse analysis: A sample text". Language, vol. 28, pp. 474-494, 1952.

[8] L. Polanyi. “A formal model of the structure of discourse”. Journal of Pragmatics, vol. 12, pp. 601-638, 1988.

[9] P. Mc Kevitt, D. Partridge and Y. Wilk. "Why machines should analyse intention in natural language dialogue". International Journal of Human-Computer Studies, vol. 51, pp. 947-989, 1999.

[10] J. R. Hobbs. On the Coherence and Structure of Discourse. In Technical Report. Standford, CA: Center for the Study of Language and Information. 1985.

[11] P. Martín-Rodilla and C. Gonzalez-Perez. “An ISO/IEC 24744-derived modelling language for discourse analysis”. Proceedings ofIEEE Eighth International Conference onResearch Challenges in Information Science, pp. 1-10. 2014.

[12] E. Peltenburg. “The Colonisation and Settlement of Cyprus. Investigations at Kissonerga Mylouthkia, 1976-1996”. Åström Verlag, Sävedalen. 2003.

[13] E.I. Peltenburg. Kissonerga-Mylouthkia, Cyprus 1976-1996 [data-set]. York: Archaeology Data Service [distributor] doi:10.5284/1000051.

[14] M.W. Van Someren, Y.F. Barnard and J.A. Sandberg. The think aloud method: A practical guide to modelling cognitive processes (Vol. 2). London: Academic Press. 1994.

[15] R. Agrawal, T. Imieliński and A. Swami. "Mining association rules between sets of items in large databases". Proceedings of the ACM SIGMOD international conference on Management of data - SIGMOD '93. doi:10.1145/170035.170072. 1993.

[16] P. Martín-Rodilla. Software-Assisted Knowledge Generation in the Archaeological Domain: A Conceptual Framework. CAiSE (Doctoral Consortium). 2013.