TUM Logo

Guardians of the Data: Searchable encryption for document management systems

Guardians of the Data: Searchable encryption for document management systems

Supervisor(s): Fabian Franzen
Status: finished
Topic: Others
Author: Luka Tomas
Submission: 2025-03-24
Type of Thesis: Bachelorthesis

Description

As digitization accelerates, efficiently managing and securely storing digital documents has become
increasingly critical. Document Management Systems (DMS) streamline organizational
workflows by providing indexing, metadata extraction, and full-text search capabilities. However,
most existing DMS solutions, such as the popular open-source project Paperless-NGX,
either neglect encryption entirely or rely on server-side encryption, which undermines data
confidentiality by storing encryption keys alongside encrypted documents. This fundamental
limitation necessitates a robust solution capable of ensuring both confidentiality and search
functionality simultaneously.
We address this challenge by introducing Miniwhoosh, a search library implementation
with support for Searchable Symmetric Encryption (SSE) based on an inverted index. We
detail the design, implementation, and integration process of Miniwhoosh into the Paperless-
NGX project, emphasizing the minimal disruption to the existing architecture. Furthermore,
we provide a comprehensive performance analysis comparing Miniwhoosh’s encrypted and
unencrypted modes with the original Whoosh implementation. The evaluation covers document
ingestion performance, search latency and memory usage, demonstrating Miniwhoosh’s
viability for practical deployment. Our analysis identifies clear performance gains, highlighting
that while Miniwhoosh significantly improves indexing speed and maintains comparable
search responsiveness, it does introduce memory overhead primarily due to encryption.
Overall, this thesis contributes a practical, efficient, and secure document management
solution, bridging the gap between robust confidentiality and usability, thereby enabling
secure storage and search operations on untrusted servers.