Community Resource Development
(CRI: CRD) Developing a Dark Web Collection and Infrastructure
for Computational and Social Sciences

 

Proposal Summary
PI: Dr. Hsinchun Chen, Artificial Intelligence Lab, University of Arizona

This project is intended to create a large archive, known as the Dark Web archive, and a research infrastructure for use by computer and information scientists as well as social scientists studying a wide range of computational problems and social and organizational phenomena. The archive will
ultimately comprise testbed data containing thousands of multilingual websites including millions of web pages and thousands of multimedia files by U.S. domestic, Middle Eastern, and Latin American terrorist and extremist groups. A methodology and spidering (collection building) tools for timebased automated capture of terrorists groups’ websites and multimedia resources will be extended and enhanced from previous work through this project; this approach will then support monthly updates of the entire collection. In addition, the infrastructure will include tools supporting search, browse, and analysis capabilities.

Researchers all over the world in a variety of disciplines are working on developing the means for understanding extremist groups, terrorism and terrorists: their effects on the world; how they communicate, organize and propagate themselves; how they are funded; who they connect with and why, etc. As a prototype, the data in the Dark Web has been highly requested and sought after by numerous researchers working on these problems: not only social scientists and analysts struggling to understand the phenomenon of terrorism, but also computer and information scientists who work in knowledge discovery and dissemination (KDD), in data and text mining including entity extraction, and in many other fields of endeavor. However, as a prototype, the Dark Web archive has not been accessible or usable except by those few able and willing to build their own interfaces and tools; it is not readily updated as the spidering process still needs enhancements ; and it does not support analysis. At various stages throughout the project, input and evaluation will be sought from the community to be served, including computer and information science (CIS) researchers, social scientists, terrorism researchers and analysts, and others. Dissemination and distribution will also be an important component: existing conferences, workshops, and other venues will be leveraged to ensure that knowledge about the availability of the Dark Web archive and infrastructure is widely disseminated.

Intellectual merit: CIS researchers will be able to utilize the Dark Web archive for a wide range of exercises: to develop video and voice recognition technologies, advance information retrieval techniques whether mono- or multi-lingual, and improve methodologies in data and text mining as well as machine learning and artificial intelligence. Social scientists will be able to use the archive to study dynamic “dark” networks and the linkages or relationships between organizations, verify hypotheses about the use of the web by extremist/terrorist groups, and study the inter-relationship of culture, religion and politics. The Dark Web archive will support the comparison of current and historical data, minimize manual analysis by researchers in the social sciences; and enable the replication of experiments by researchers.

Broader impacts: In addition to supporting researchers in information, computer and social sciences, this project will also have some utility for the national security sector, including law enforcement and the intelligence community, although that is not its primary purpose. The letters of
support accompanying this proposal amply demonstrate the breadth and depth of the proposed work, and its potential impact on researchers in both computational and social sciences.


Submitted to the National Science Foundation under proposal number 0709338. See the NSF project page.

   
ACTIVITIES UPDATE for 2007-2008
 

Specifications for all equipment and workstations have been finalized.  Through several bidding processes, about 75% of the equipment has been received and tested; the remaining 25% has been ordered with shipment expected July.  Installation on primary storage servers has begun and will be completed by the end of August, 2008.  The physical infrastructure has been refined since project inception and some of the storage  functions have been moved from remote servers directly to Dark Web team members’ desk top machines.  This provides them with increased accessibility and flexibility in continuing to develop and manage the test bed.  However, user access to the test bed will still be provided through secure servers.  We have made detailed, updated plans on how to most efficiently configure the incoming equipment.

The equipment received to date includes the SAN, the machines for the spider farm, and the monitors for the desktop stations.

For the test bed, two additional spidering runs have been completed to enhance the existing prototype. This has also provided us with additional opportunities to refine our semi-automated spidering process.  However, further improvements to the process will be the focus of Year 2 in the project.

Finally, in order to obtain additional input from potential users of the system, a special session on the Dark Web Archive was convened at the recent IEEE International Conference on Intelligence and Security Informatics in Taipei, Taiwan.  There, we met with an international group of students, faculty, government researchers and others, and solicited their feedback about their current work and how the Dark Web Archive might be useful to them.  Our findings are briefly summarized below.

   
FINDINGS UPDATE for 2007-2008
 

During this first phase of the project, we have continued to publicize the Dark Web and to solicit useful features and needs from potential users.  In order to obtain additional input from potential users of the system, a special session on the Dark Web Archive was convened at the recent IEEE International Conference on Intelligence and Security Informatics in Taipei, Taiwan.  We met with an international group of students, faculty, government researchers and others, and solicited their feedback about their current work and how the Dark Web Archive might be useful to them. 

Of the 35 participants present at that session, 17 returned the surveys we used to solicit additional feedback.  Returned surveys were received from:
            4  Students
            6  Faculty members (from various universities)
            4  Government members
            2  Researchers (for private or public organizations)
            1  Law enforcement member
           
Student, faculty and researcher participants were from a variety of disciplines including computer science, information science, terrorism research and other social sciences.

Interest expressed in the use of Dark Web archive and infrastructure included using it for:

  • Botnet, computer forensics, cybercrime, and cyber security research and crime investigation
  • Data mining and text mining for sentiment/violence analysis, social networking and contact analysis as well as computational modeling of criminal behavior
  • Network security including attack patterns and hacker tracing
We will be continuing to reach out more widely, as the Dark Web archive begins to take shape.  User feedback will be critical to the success and utility of the infrastructure.
 
MARKER
PUBLICATIONS
 
  • Chen, H., and Dark Web Team (2008). "IEDs in the Dark Web: Genre Classification of Improvised Explosive Device Web Pages," IEEE International Intelligence and Security Informatics Conference (Taipei, Taiwan, July 17-20, 2008). Springer Lecture Notes in Computer Science.
  • Chen, H., and the Dark Web Team (2008). "Discovery of Improvised Explosive Device Content in the Dark Web." IEEE International Intelligence and Security Informatics Conference (Taipei, Taiwan, July 17-20, 2008). Springer Lecture Notes in Computer Science.
  • Chen, H. and the Dark Web Team, "Sentiment and Affect Analysis of Dark Web Forums: Measuring Radicalization on the Internet" (2008) IEEE International Intelligence and Security Informatics Conference (Taipei, Taiwan, July 17-20, 2008). Springer Lecture Notes in Computer Science.
 

 
 

Team Members

Dark Web Terrorism Research page

 

AI Lab | MIS Department | Eller College | UA | Disclaimer | Privacy | Contact Us

Eller College of Management | The University of Arizona
1130 E. Helen Street | P.O. Box 210108 | Tucson, AZ 85721-0108 | 520.621.6219

© Copyright The University of Arizona. All rights reserved.