10. Elwasify A. I. A Combined Model between Artificial Neural Networks and ARIMA Models // International Journal of Recent Research in Commerce Economics and Management (IJRRCEM). 2015. Vol. 2, Issue 2. P. 134-140.
-□ □PieeHb mo4Hocmi авторськог атрибуцп текста не е достат -ньо високий на лексичному та синтаксичному рiвнях мови, бо ц piem не е строго органйованими системами. У даному до^джент авторська атрибуция текста грунтуеться на диференщацп фоностатистичних структур стилiв.
Розроблено систему диференщацп фоностатистичних структур стилiв, яка вiдpiзняеться вiд кнуючих вибраним piвнем мови - фонологiчним. На цьому piвнi мови можна отримати результати з бшьшою точтстю. Окpiм того, побудована система грунтуеться на модульному принцит, що дае змогу швидко модифжувати розроблений програмний продукт.
Розроблено методи та моделi, як грунтуються на теори математичног статистики i дають змогу тдвищити точ-тсть диференщацп фоностатистичних структур стилiв. Побудовано метод комплексного аналiзу фоностатистичних структур стилiв, багатофакторний метод визначення ступетв дп фактоpiв стилю, тдстилю та авторськог мане-ри викладу. Побудовано статистичну модель стилевог диференщацп за методом ранжування та статистичну модель визначення загальног стилевог маpкованостi дослiджува-ного текста. Розроблено програмну систему диференщацп текстiв.
Кpитеpiем диференщацп текстiв е середш частоти груп приголосних фонем. В процепpеалiзацiг системи використа-на мова програмування java, що забезпечуе платформо-неза-лежтсть програмного продукту.
Наведено результати застосування розроблених методiв, моделей та програмних засобiв, як пгдтверджують, що авторська атрибуцш текста на фонологiчному piвнi е ефек-тивп1шою.
Розроблен методи, моделi та засоби авторськог атрибуцп текста можна використати при встановлеит в^дсотку творчого внеску кожного iз спiвавтоpiв наукових праць
Ключовi слова: середн частоти груп приголосних фонем, стилева, тдстилева та авторська диференщащя текстiв,
програмна система, метод, фонема, фонологiчний piвень -□ □UDC 519.711; 681.5; 621.382
I DOI: 10.15587/1729-4061.2018.1320521
DEVELOPMENT OF METHODS, MODELS, AND MEANS FOR THE AUTHOR ATTRIBUTION OF A
I. Khomytska
Assistant
Department of Applied Linguistics * V. Tesly u k
Doctor of Technical Sciences, Professor Department of Automated Control Systems* E-mail: vasyl.m.teslyuk@lpnu.ua A. Holovatyy PhD, Associate Professor Department of Information Technologies Ukrainian National Forestry University Henerala Chuprynky str., 103, Lviv, Ukraine, 79057 O. Morushko PhD, Associate Professor Department of Social Communication and Information Activity* *Lviv Polytechnic National University Bandery str., 12, Lviv, Ukraine, 79013
Modern information technologies (IT) are widely used in various fields of science and technology. One of such areas is applied linguistics [1, 2] where IT has been applied
to the author&s attribution by using content analysis [3], for the attribution of texts in legal proceedings [4, 5], and for a linguistic analysis of the text commercial content [6]. IT is employed in the semantic analysis of Ukrainian texts [7] and for carrying out scientific research related to programs
of distance learning [8]. A system of algometric algebra has been used in a grammatical analysis for the categorization of text documents and in determining the author&s style [9-11], etc.
Of particular importance under conditions of globalization, is the task on identifying the authorship of texts. An analysis of the subject area that we conducted reveals that in most cases the differentiation of phonostatistic structures of styles in the process of establishing the author of the text, as well checking a text for plagiarism, involved methods, models, and software tools at the lexical level of a language. However, the phonological level differs from other levels of the language by a stricter structure and ordering of the elements. It is easier to formalize and mathematize. Therefore, it is advisable to apply methods, models, and software tools at the phonological level of a language in order to identify the author of the text and to check the text for plagiarism. Accordingly, the development of methods, models, and tools that would enable the IT differentiation of phonostatistic structures of functional styles in the English language is an important and relevant task.
The task on identifying the authorship of the text implies differentiation of texts. Texts are differentiated at the different levels of a language in order to identify their differences and similarities. Thus, the differentiation of texts at the lexical level was performed when modeling grammatical structures [12]. However, the lexical level is an open system. The number of elements is not constant. The system is updated with new words (neologisms) while rarely used words become archaic. An author&s style reflects changeable processes in a lexical system. Therefore, the identification of authorship at the lexical level is of a probabilistic character. It is worth noting that grammatical structures are abstract, idealized models, and do not provide for a complete reflection of the speech process. This makes it difficult to define the differential attributes of the author&s style. Modeling of semantic structures was used for text differentiation [13]. Semantic structures are the abstract constructs whose implementation depends on the context. That is why a focus on semantics predetermines a probabilistic character of the author&s attribution. Texts are differentiated at the lexical and semantic levels when splitting a sentence into key words [14]. Determining the dominant lexical units was used when distinguishing texts in the areas of culture and tourism [15, 16]. Determining the dominant key words does not make it possible to cover lexical vocabulary characteristic of a particular author and is not promising in identifying the author&s style. It should be noted that the results of text differentiation at the lexical and semantic levels of a language have a more probabilistic character than that at the phonological level. In contrast to a phonological level, the number of elements is not constant and that compromises the accuracy of calculations. In addition, no combination of the most effective quantitative methods was determined to differentiate texts at each level of a language [17]. When establishing the differential attributes of the author&s style using statistical methods, no scheme style^substyle^author was applied, which facilitates determining statistical parameters for the author&s manner of presentation in texts from different subjects [18]. Information technologies were not employed
for the author&s attribution at the phonological level, and that does not provide the proper level of accuracy [19, 20]. Software systems do not implement a combination of statistical methods, which would provide efficiency of the author&s attribution [21]. An analysis of the scientific literature that we conducted revealed that the task on improving the accuracy of text differentiation remains unsolved. To solve the problem, it is required to carry out author&s attribution at the phonological level, to apply the combination of statistical methods that is the most efficient to obtain probable results and to determine the degree of validity of factors related to style, substyle, and the author&s manner of presentation.
The aim of present study is to improve the accuracy of differentiation of phonostatistic structures of styles in the English language based on the developed methods, models, and software tools for the implementation of the author&s, substyle, and style text attribution.
To accomplish the aim, the following tasks have been set:
- to develop a mathematical basis for the system of differentiation of phonostatistic structures of functional styles in the English language using the theory of mathematical statistics, which would make it possible to improve the accuracy of output results;
- to construct models for the differentiation of phonosta-tistic structures of styles of the English language;
- to devise a structure of the system and the software that would be based on a modular principle, which would make it possible to rapidly modify the developed IT tools and to ensure that the software system is platform-independent.
The core of any software system is a mathematical basis that includes the developed methods. The constructed mathematical basis for the differentiation of phonostatistic structures of styles in the English language includes the following.
Step 1. Check the conformity of frequency of consonant phonemes to the law of normal distribution using the Pearson criterion and a simplified criterion by Romanovsky.
Step 2. Differentiation of texts for the Student&s criterion.
Step 3. Determine the groups of consonant phonemes, based on which we established substantial differences in the pairwise comparison of texts.
An algorithm of the ranking method includes the following steps:
Step 1. Determine the mean frequency of groups of consonant phonemes.
Step 2. Construction of descending series of mean frequencies for each group of phonemes.
Step 3. Determine significant differences between the pairwise compared texts based on the difference in ranking.
An algorithm of the method for determining the distances between styles is implemented by the following steps:
Step 1. Differentiation of pairwise compared texts based on the Student&s criterion.
Step 2. Derive from the formula for the Student&s criterion a formula for determining the distances between styles
(l = izio.) [24, 25].
Step 3. Determine a large, medium, and insignificant distance between styles.
The method considered makes it possible to differentiate with greater accuracy the styles, substyles, and texts by different authors.
Step 1. Determine substantial differences in the pairwise comparison of texts based on the Student&s criterion: different styles, different substyles, different authors.
Step 2. Determine a significant, medium, and insignificant degree of action factors related to: style, substyle, the author&s manner of presentation.
The method makes it possible to establish with a higher accuracy the affiliation of the text under study to a specific style, substyle, and to identify its author.
Based on the developed methods, we have built statistical models for the style, substyle, and author&s differentiation of texts by the ranking method. An algorithm of the specified models includes the following steps.
Step 1. Determine the mean frequency of groups of consonant phonemes for texts: of different styles, different substyles, by different authors, determine the highest and lowest indicators of values for the mean frequency, determine large, medium, and minor differences based on the proposed formula
ra = r - r .
X1 -x2 maxX1 minx2
The models developed make it possible to take into consideration, with a greater accuracy, the position of a phoneme in a word, to perform the style, substyle, and the author&s attribution of texts based on the ranking difference.
We have developed a statistical model for determining a general stylistic markedness of the examined text. An algorithm for constructing the model includes the following steps:
Step 1. Determine essential differences, based on the Student&s criterion, in the compared texts: different styles, different substyles, by different authors, in various subjects.
Step 2. It is proposed to determine the mean value for the three obtained t-values for the Student&s criterion:
tf + tf + tf
. J1 f2 Id
Step 3. Determine a large, medium, and insignificant stylistic markedness of the examined text.
The developed model is a combination of three models represented in papers [26, 27]. The model needs to be applied in the case when texts belong to the same style and substyle, but they are by different authors and address a different topic. The model makes it possible to identify the author of texts on various subjects with a higher accuracy. Therefore, the developed methods, models, and algorithms male it possible to improve the accuracy of differentiation of the phonostatis-tic structures of styles.
The methods and models developed have been implemented in the programming language java, in the system of differentiation of phonostatistic structures of styles in the English language.
The structure of the developed software is shown in Fig. 1; it is based on a modular principle and allows individual customization and support for each module, it ensures high reliability of the system [28]; the built software is easily upgraded.
The algorithm the English language style differentiation based on the mean frequencies of groups of consonant phonemes, which is implemented in the system, implies the execution of a sequence of the following basic steps:
The result is the determined values of the mean frequencies of groups of consonant phonemes.
Provided that the mean frequencies of groups of consonant phonemes comply with the normal distribution, it is necessary to perform computerized style differentiation based on the mean groups of frequencies using the Student&s criterion.
The algorithm of functioning of the system supports simultaneous work with two text files (Fig. 2). This includes opening two files, converting them into transcription, sampling of consonant phonemes, splitting the sample into portions, calculation of the number of phonemes in each portion and the sample, merging into groups and further verification
by the Pearson criterion. This is performed so that it is possible, provided the mean frequencies of groups of consonant phonemes comply with the normal distribution, to compare the texts for the existence of phonetical difference.
Fig. 1. Structure of software system for the differentiation of phonostatistic structures of functional styles of the English language
In the process of software development we constructed the following basic classes: Main, Window, PanelFile, ExtFileFilter, PanelTranscription, DistributionOfPortion, DitributionOfGroup, CriterionPearson, CriterionStudent. The developed structure of classes enables choosing a text file, checking whether a given file has the .txt extension, converting the text into a transcription variant. Input samples are checked by the system for conformity with the normal distribution law and are differentiated based on the mean frequencies of groups of consonant phonemes.
Using the java programming language ensures that the developed software is platform-independent.
We have chosen as the material to study texts written in the literary, conversational, newspaper, and scientific styles. Specifically, Fig. 2 shows example of the interface for adding new words to the Word.txt and Transcription.txt files.
We tested the system using material of the texts written by different authors in the scientific style. In the "Pearson Criterion" tab we verified conformity of the texts to the law of normal distribution. It was established that groups of labial, front-alveolar, mid-alveolar, post-alveolar, nasal, sonorous, slit and closed phonemes comply with the law of normal distribution. Based on the differentiation of phonostatistic structures of texts, by different authors, related to the scientific style, for the Student&s criterion, we established significant differences in styles for groups of labial, front-alveolar, post-alveolar, nasal, slit and closed phonemes. Random differences were found for groups of mid-alveolar and sonorous phonemes. Thus, we have established phonostatistic parameters for the differentiation of texts by different authors.
Based on the research results, obtained for the scientific, fiction, conversational and newspaper styles, we determined significant substantial differences for the group of slit phonemes by the ranking method (rank indicators difference is 6). Fig. 3 shows statistical model of style differentiation for the scientific and conversational styles based on the ranking method for the group of slit phonemes for the case of an undefined position of the phoneme in a word:
|1>| Додавання нових cniB — □ X Слова, ям будуть додаы до файлу &Word&!!! 1 originally approached Java as just another programming language which in many senses it is But time passed and studied more deep began to see that the fundamental intent of this language was different from other had languages seen up point Programming about managing complexity problem you want solve laid upon solved Because most our projects fail yet all am have iOki| Додавання нових слш-фонем — □ X Слова-фонеми, як\\ будуть додам до файлу &Transcription&!!! aj endseneli eprotjt d3ave aez d3esteneder prograemir) leerjgweds witf en mcni sensezitizbettajm paest send stedid тэг dip bigeen tu si dset 5э fendementel intent avflis Iaer]gwad3wazdifarantfram adar Iaer|gwad3az haed sin appojnt progrsmin ebawt m©ned3in kemplekseti prablem ju want salv led эрап salvd bitozmost awsr prad3£kts fel jet al aem ewEralmostnen haevgan awt dssajded iOki|
I originally approached Java as just another programming language which in many senses it is But time passed and studied more deep began to see that the fundamental intent of this language was different from other languages had seen up point Programming about managing complexity problem you want solve laid upon solved Because most our projects fail yet all am aware almost none have |gone out decided aj arxd3eneli eprotjt d3ava sez d3ast апаЗаг progr£emii] lser]gwed3 wit J an rmsni ssnsaz xt iz bat tajm paest and stadid тэг dip bxgcen tu si 6«et бэ fundamental intent av 3xs lseqgwed3 waz dxfarant fram аЗаг lser]gwad3az h«d sin эр pэjпt prograemir) abawt maenad3ir) kamplsksati prablam ju want salv led apan salvd Ьхкэг most awar prad3ekts fel jet э1 гет awsr эlmost nan haev g3n awt dasajdad
i > < >
Рд 9, ствп 1 Рд 8, ствп 43
Fig. 2. Example of the interface for adding new words to the Word.txt and Transcription.txt files
Fig. 3. Statistical model of style differentiation based on the ranking method: 6 — significant essential difference (6 units);
For the case of identifying the authorship of texts related to various subjects, but of one style and substyle, it is appropriate to apply a statistical model that combines three statistical models-elements: determining a style affiliation; determining a substyle affiliation; identifying an author of texts related to various topics. This is a statistical model for determining a general stylistic markedness of the examined text (Fig. 4).
SSI—I—I—I—I—I—I—I—I I& M NS___—.......
CS"—I—I—I—I—I—IPM
Fig. 4. Model: a — style differentiation for the case when a phoneme is at the beginning of a word when comparing
texts of poems by Moore (PM), conversational (CS), newspaper (NS), and scientific styles (SS); b — substyle differentiation for the case when a phoneme is at the end of a word when comparing texts of poems by Byron and Moore, fiction by Byron (FB) and drama by Shaw (DSh); c — author&s differentiation for the case of the undefined position of the phoneme in a word when comparing texts of poetry by Byron and Moore; the belles letters style (BS)
The research results based on 5 out of 553 experiments (described earlier, in particular, in [22, 23, 26, 27]) showed that the developed methods, models, and tools make it possible to improve the efficiency of author&s attribution of a text. The phonological level selected for the study is organized stricter than the other levels of a language. However, the phonological system is probabilistic in character with the probability of making an error being equal to 5 %. The developed software system could be applied for identifying the authorship of a text in fiction, as well as legal, official, business, and scientific areas. Further study will address the
development of a software system for the author attribution of a text for each of the groups of consonant phonemes in order to determine a group of phonemes for which author attribution would be most effective.
The research results obtained could be used for identifying the authorship of the examined text, as well as for verifying a text for plagiarism. Further research seems promising in terms of defining phonostatistic parameters, specifically, the style differentiating power of groups of consonant phonemes whose mean frequencies are the criterion for the differentiation of an author&s style.
References