SUNCODAC

Computer-meditated communication (CMC)

CMC refers to all forms of human communication online. A broad distinction can be made between written and spoken CMC categories, with or without video-support, which, in turn, may be classified according to a number of dimensions (Herring, 2007): private vs. public, synchronous vs. asynchronous or person-to-person vs. group communication, among others. CMC includes a wide spectrum of forms, including e-mail, instant messaging, blogs, discussion forums, chat rooms, social networking sites, comments on news sites, video-conferencing, etc.

CMC provides an interesting research area for many disciplines including sociology, psychology and linguistics. Linguists, in particular, “investigate (...) how the technical and pragmatic conditions of the underlying technology affect the strategies of language production and understanding” (Beißwenger & Storrer 2008, p. 293).

Although CMC constitutes a major form of language use nowadays, it remains relatively neglected in Corpus Linguistics. CL has been tapping into the immense and easily accessible language data on the internet for the study of language, but, despite the huge volume of collected data (Mark Davies’ corpora, for instance), these large corpora crawled from the Web are barely representative of the rich variety of CMC contexts mentioned above.

The collection of CMC data raises some major issues (Beißwenger & Storrer, 2008) related to the special characteristics of the context and the content of CMC texts. CMC data rich in layout properties such as typefaces, colour and font sizes or graphic elements such as emojis can be difficult, if not impossible, to represent in a corpus when data are captured in ASCII format by saving screen content. Difficulties are also posed by hyperlinks to web content (especially when the URL is not explicit in the screen data) or to other parts of text, for example, responses to a previous post in a forum. Collecting CMC data also raises ethical concerns as to the desirable anonymization of the eventual corpus contents, especially when obtaining informed consent in advance is not feasible or desirable because it might compromise the authenticity of the collected data.

CMC texts have distinct features that make them difficult to handle by existing NLP tools originally developed for traditional text forms. The widespread use of non-standard grammar and lexis in some CMC contexts is likely to pose problems, for instance, for many state-of-the-art PoS-tagging software. Some significant moves are being initiated to overcome this deficit. For instance, TEI (Text-Encoding Initiative) currently hosts a special interest group working on the development of the TEI standards to allow for the representation of CMC data. The group is working in the design and implementation of proposals that may complement existing TEI guidelines to represent the structural specificities of many CMC genres.

A frequent complaint among specialists in CMC is that many existing corpora were compiled for the purposes of specific projects and are not publicly available. Legal concerns resulting from the unclear legal status of CMC data may explain the non-dissemination of these materials. Some initiatives are also being undertaken to overcome this situation like CLARIN, a European infrastructure for sharing digital language resources for research purposes. CMC corpora in CLARIN are open-access and can be queried online, downloaded or both. Various forms of CMC are represented, including blogs, forums, Twitter, news comments and Whatsapp messages, among others. Languages include Slovenian, Dutch, German, French, Italian, Estonian, Lithuanian or Norwegian, but not English. Apart from their role in the dissemination of existing CMC corpora, CLARIN is conducting a series of initiatives to improve and homogenize existing methods and tools to represent and annotate CMC texts. Admirable as the CLARIN initiative is, it is clear that, currently, it only represents a small fraction of the variety of forms of computer-mediated communication existing today and only a small number of languages.

Why a corpus like SUNCODAC?

The Internet and the NTs have become a core component of our personal and professional lives. One area where this is especially true is in the world of higher education. University has changed. Roles have been redefined: teachers have become facilitators and for students the emphasis is now on the collaborative construction of knowledge, often online. New computer-mediated genres are evolving to carry out these activities, which demand new socio-pragmatic skills, including the ability to project an appropriate online self and framing criticism in an appropriate manner. Learning how to interact online in academic contexts frequently starts at the undergraduate level, often in increasingly multicultural settings and in a language other than the students’ mother tongue. Online interaction through a second language clearly places extra constraints that are worth exploring through samples of authentic communicative events. SUNCODAC seeks to open a new window into this reality.

SUNCODAC contains the Moodle-based discussions generated in an undergraduate translation course at the University of Santiago de Compostela over a 4-year period, from 2014 to 2017. The first two years, the discussions were mostly held in Spanish (with English and Galician being used occasionally), most of the students’ mother tongue, while in the last two years English was set as the mandatory language, a lingua franca for local students and exchange students from different nationalities. The context of the activity was a blended English-into-Spanish translation course, with the forum providing a natural complement to the face-to-face teaching. The course was offered as a second-year compulsory subject for English majors as well as for students from other degrees taking English as a minor. It was also very popular among exchange students, who made up about one third of the total number of participants. Every week, there was a class discussion around the translation of a set text and previously done as homework by the students. This face-to-face discussion was followed by an online discussion of a different passage of the same text, a sort of follow-up to the in-class discussion, using the Moodle forum tool. The aim was to work together towards an optimal translation of the set text. Each individual thread (1 per discussed excerpt) was led by one student who volunteered as a “moderator” and whose role was, first, to upload a draft translation and, second, to produce a final revised version after reading the comments and suggestions made by classmates. Each individual thread consisted of the following:

Lecturers’ instructions. A single opening post by the lecturers with the source text’s excerpt, the moderator’s name, basic instructions for participants in the activity and deadlines.
Moderator’s first translation. A single post containing the moderator’s suggested translation of the set excerpt.
Peer feedback. A long thread of posts, the core of the discussion, where the moderator’s classmates identify problems in the draft translation, make comments and suggestions for improvement and discuss the suitability of different translation solutions.
Moderator’s improved version and summary of discussion. Usually a single post with the moderator’s final improved version, plus a summary of the discussion’s highlights and a general response to classmates’ most significant or repeated comments and suggestions.
Lecturer’s concluding remarks, with assessment and appraisal of the activity. One single post where lecturers summarize the most significant aspects of the discussion, in terms of their pedagogical value, and singles out significant individual contributions (possible models). Important aspects that might have been overlooked in the discussion are also identified and suggestions for improvement made, regarding both the moderator’s work and the rest of the students’ contributions to the task.

The SUNCODAC holdings

SUNCODAC contains 61 full forum discussions (threads) held over the period 2014-2017, totalling slightly under 600,000 words. 2 instructors and 520 students contributed to the task: 73.8% were native Spanish/Galician speakers (including both instructors), 14.6% were Chinese speakers, 4.8% were native English speakers, and 6.73% made up a miscellaneous group of students with different language backgrounds and nationalities, principally, French, Italian and German. The distribution of participating students, by sex and language background per year, is shown in Table 1:

Table 1 Distribution of participants in the SUNCODAC forums, by year, sex and language background

2104	Native speakers of Spanish/Galician	Non-native speakers of Spanish/Galician			Total
		L1 English	L1 Chinese	L1 Other
Female	59	3	15	5	82
Male	22	0	5	0	27
Total	81	3	20	5	109
2015
Female	71	6	26	7	110
Male	20	6	10	2	38
Total	91	12	36	9	148
2016
Female	73	3	4	9	89
Male	26	1	2	3	32
Total	99	4	6	12	121
2017
Female	92	5	11	9	117
Male	19	1	3	2	25
Total	111	6	14	11	142
Corpus total
Female	295	17	56	30	398
Male	87	8	20	7	122
Total	382	25	76	37	520

Special CMC features

Letter type features like underlining, italics, bold, font sizes and colour were lost during the compilation process and are, therefore, not represented in the corpus. However, other equally significant CMC features in the forums were captured:

Emoticons and other punctuation expressing emotions. All emoticons in SUNCODAC are punctuation-based (e.g., :), ;), etc.), the forum tool in Moodle not allowing for the inclusion of any visuals, including emojis. The corpus also reproduces exclamation marks used to express emotion (e.g. Congrats !!), including single, double and triple exclamation marks, as well as capitalized words, for emphasis and similar (e.g. THANK YOU).
Typos and wrong spellings, which were reproduced verbatim. Both typos and wrong or non-standard spellings may be interesting research variables. Although they do not occur as frequently in asynchronous online communication as they do in more spontaneous synchronous CMC environments (e.g., social networks, chats, etc.), some of the participants seem to use typos and ad-hoc spelling purposefully to instill informality in their contributions and to express in-group solidarity.
Non-standard grammar and vocabulary items were left intact. SUNCODAC contains numerous examples of these, as many of the participants are non-native speakers of the language. Non-standard forms and errors are an important variable for research purposes in SUNCODAC.
Hyperlinks to websites cited in the text, most often to support the poster’s point. Whenever “hidden” in a clickable word, the original URL was recovered and becomes visible by hovering the mouse over the word or expression. This recovered information may be key, for instance, to study the amount and form of evidentials used by the students in their discussions, students’ notions of authority or reliable sources and how this information is used to infuse credibility. An interesting question to ask, for instance, is whether these evidentials are used differently by more or less successful participants (e.g., more or less cited students).
Paragraph breaks and bulleted lists.

Annotation

As of today, the corpus is minimally annotated, but texts have been saved in XML format, so that future annotations may be made by the corpus authors or individual users with specific research needs. The corpus has not been PoS-tagged. To date, the only annotation that has been added is the inclusion of new tags to mark different sections in the feedback posts (the most important component), allowing for the possibility of browsing or performing searches exclusively in specific sections.

Metadata

Decisions as to what metadata to include in a corpus will naturally depend on the compilers’ envisaged research questions but, unfortunately, may also be limited by the actual availability of different types of contextual information during the compilation process. Metadata are a major issue in corpus design and one that may constrain or increase a corpus’ research possibilities dramatically.

One of the envisaged goals in the compilation of SUNCODAC was to gain insights into processes of collective construction of a CMC genre: the vast majority of participants in the SUNCODAC forums were novices to the genre, which means they were learning how to participate in this type of academic forums as the course developed. Given the multicultural nature of the participants in the events, SUNCODAC was also expected to shed light on cross-cultural differences in the online communication practices of users with different lingua-cultural backgrounds. Another area of interest for the corpus compilers was the potential of the tool to identify and describe the existence of power relationships in the group: some individual participants or groups were expected to enjoy a certain prestige and serve as models for other members of the community . Finally, SUNCODAC may help reveal and describe possible differences in the linguistic behaviour of different categories of users, particularly gender differences, a major topic in CMC research. With these objectives in mind, the following set of metadata was coded for each individual post in the corpus:

post date and time.
author’s individual identity, sex and mother tongue;
post language;
source text, year and excerpt. The same text could be discussed in different years. Also, more than one excerpt of the same text could be discussed in a given year. Consequently, for each post, source text and thread year and specific excerpt were coded.
post type: source text, draft (moderator’s first translation), feedback (classmates’ comments and suggestions), summary (moderator’s improved translation plus comments), final comments (lecturer’s final assessment) and other (posts that do not fit into any of the previous categories).

Anonymization

Message authorship and personal references have been anonymized by replacing all references to students by name with 5-digit codes (first two digits specify year of participation, next three are a combination of the student’s name and surname initials). This allows tracking individual participants in the forums, for example, if one wishes to browse or search all contributions by a given student or to focus attention on a group of students by selecting the desired user id’s. Since participant codes are also searchable character chains, it is also possible to keep track of references to a given student by other participants in their posts. Four types of address forms or personal references have been identified: name (e.g. “María José”), name+surname (e.g. “María José Pérez”), familiar variant or nickname (e.g. “Majo”) or initials (e.g. M.J.). Labels can be displayed by moving the mouse over the 5-digit participant code.

References

Beißwenger, M., & Storrer, A. (2008). Corpora of Computer-Mediated Communication. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An international handbook (pp. 292–309). Mouton de Gruyter.

Herring, S. C. (2007). A Faceted Classification Scheme for Computer-Mediated Discourse. Language@Internet, 4. https://nbn-resolving.de/urn:nbn:de:0009-7-7611.