Authorship Analysis
Stylometric Online Authorship Identification: An Exploratory Study

 

Online communication mediums such as email, web sites, newsgroups, online forums, and chat rooms have been ubiquitously integrated into our everyday lives. Unfortunately, online channels are also being misused for distribution of unsolicited and inappropriate information (e.g., extremist propaganda, online pornography, online gambling). The anonymous nature of these channels makes them an ideal source of communication for criminal groups and extremist organizations. Additionally, the evolution of the internet as a major international communication medium has spawned the advent of a multilingual dimension.

Authorship analysis has been used to analyze long, precise English texts such as plays of Shakespeare (authorship identification) or student’s class papers (plagiarism detection). Few past studies have addressed the multilingual issues of online communications. The language-specific stylistic characteristics and the informal nature of online communications present unique research challenges. In order to address these challenges, we aim to develop a comprehensive framework and associated text mining techniques for multilingual online stylometric feature extraction and authorship classification. We plan to focus this exploratory study on two languages, English and Arabic. The linguistic differences between these two languages will allow us to evaluate common stylistic representations and explore other language-specific problems. We plan to develop comprehensive English and Arabic lexical, syntactic, structural, and content-based features that are suited for identifying online writing styles. We propose to evaluate these features using several large-scale public extremist forums (in English and Arabic) collected from the Web. We also plan to develop a scalable principle component analysis based feature reduction technique for authorship classification. Previous authorship analysis research was only able to analyze a limited number of authors (typically 5-20 authors). We aim at developing scalable online authorship analysis techniques that can be used to analyze 100s to 1000s of anonymous authors (a common scenario for web communications). Feature (subset) selection techniques will be developed to help reduce the high dimensionality of online writing features. Lab experiments will be conducted to verify the classification accuracy and scalability (speed and efficiency) of our approach.

We believe our unique combination of comprehensive multilingual online stylistic features and the development of scalable feature classification techniques (although high-risk) can provide a potentially high-payoff solution to the challenging problem of multilingual online stylometric authorship identification. Online Arabic authorship analysis in particular is extremely difficult and high-risk. Upon successful development of this SGER project, we anticipate a strong foundation for our future “cyber trust” research. The findings can also provide important insights to several computational and social sciences communities.

The primary intellectual contribution of our research is many fold: (a) develop and examine new text mining techniques that may be suitable for identity tracing in cyberspace, (b) create new representations of people’s identities using online "Writeprints" (i.e., the representation of people’s key online writing style features), (c) evaluate the effectiveness of different multilingual stylistic features and classification techniques for improving identification scalability and robustness.

The broader impact of this research includes: (a) creating a new representation of people’s identities for classification of cyber criminals and potential extremists in online communities; (c) improving intelligence and law enforcement agencies’ abilities to detect, prevent, and respond to cyber crimes and terrorist events via the Internet; and (c) providing a large-scale research corpus and feature extraction resources for information scientists, political scientists, and terrorism researchers.

AI Lab Project Lead: Ahmed Abbasi


Funded by the National Science Foundation under award number 0646942, "SGER: Multilingual Online Stylometric Authorship Identification: An Exploratory Study"

   
 
 
    All Demos
 

AI Lab | MIS Department | Eller College | UA | Disclaimer | Privacy | Contact Us

Eller College of Management | The University of Arizona
1130 E. Helen Street | P.O. Box 210108 | Tucson, AZ 85721-0108 | 520.621.6219

© Copyright The University of Arizona. All rights reserved.