searc - Enron Email Dataset Search Engine

A distributed system for cleaning, indexing, and searching the Enron email dataset.

Introduction

SEARC is a high-performance search system built to process and make searchable the Enron email dataset - a collection of approximately 1.7 GB of email data from the Enron Corporation scandal. This project creates a scalable architecture to clean, index, and provide search functionality for this large text corpus.

How to use

Copy the unzipped data into ./data folder so that the the peoples inboxes are the child folders (ex. data/allen-p)
Run the DataPartitioner application to separate the data into folders. The final structure should look like data/A/allen-p

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
.idea/.idea.dsl-compulsory1/.idea		.idea/.idea.dsl-compulsory1/.idea
prometheus		prometheus
sql		sql
src		src
structurizr		structurizr
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
compose.yml		compose.yml
dsl-compulsory1.sln		dsl-compulsory1.sln

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

searc - Enron Email Dataset Search Engine

Introduction

How to use

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

searc - Enron Email Dataset Search Engine

Introduction

How to use

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages