BACKGROUND
Companies accumulate millions of textual documents of different nature: emails, presentations, reports, technical requirements, working instructions, etc.
Problem
Those documents are disorganized which makes the process of searching and finding the precise information time consuming.
Benefit
Grouping the files into different categories in an automatic manner, can save a lot of time for users and companies (knowledge management).
METHODOLOGY & results
Architecture: On-premise development using local indices based on Lucene.
Developing language: Java
ML techniques: Unsupervised learning algorithms, fuzzy logic and NLP techniques.
Results: Depending on the complexity of the texts and their associated vocabulary as well as the number of files, the system can group the documents with accuracy between 80% and 90%.
