 |
Fingerprint-based identification has been the oldest
biometric technique successfully used in conventional
crime investigation. The unique, immutable
patterns of a fingerprint can help a crime
investigator infer the identities of suspects.
However, circumstances have changed since
the emergence and rapid proliferation of cybercrime.
Generally, cybercrime include Internet fraud,
computer hacking/network intrusion, cyber
piracy, spreading of malicious code, and so
on. Cyber criminals post online messages over
various Web-based channels to distribute illegal
materials, including pirate software, child
pornography materials, and stolen property.
Moreover, international criminals and terrorist
organizations such as Osama bin Laden use
online messages as one of their major communication
media. Since people are not usually required
to provide their real identity in cyberspace,
the anonymity makes identity tracing a critical
problem in cybercrime investigation. This
problem is further complicated by the sheer
amount of cyber users and activities.
Fortunately, there is another type of print, which we
call “writeprint,” hidden in people’s
writings. Similar to fingerprints, writeprint
is composed of multiple features, such as
vocabulary richness, length of sentence, use
of function words, layout of paragraphs, and
key words. These writeprint features can represent
an author’s writing style, which is
usually consistent across his or her writings,
and further become the basis of authorship
attribution and facilitate identity tracing
in cybercrime investigation.
We developed a framework for authorship identification
of online messages to address the identity
tracing problem. In this framework, four types
of writing style features (lexical, syntactic,
structural, and content-specific features)
are extracted and inductive learning algorithms
are used to build feature-based classification
models to identify authorship of online messages.
To examine this framework, we conducted experiments
on English and Chinese online newsgroup messages.
We compared the discriminating capability
of the four types of features and also the
prediction power of three classification techniques,
i.e., decision trees, back-propagation neural
networks, and support vector machines. The
experimental results showed that the proposed
approach was able to identify authors of online
messages with satisfactory accuracy. All four
types of message features contributed to discriminating
authors of online messages. Support vector
machines outperformed the other two classification
techniques in our experiments. The high performance
we achieved for the Chinese dataset showed
the potential of applying this approach in
a multilingual context. Our proposed framework
and techniques are promising for automatic
cyber criminal identity tracing.
|