Clustering documents based on statistical proximity of terms

Amons O.A., Janov Y.O., Bespaly I.O.

In the given work the approach to clustering of documents collections with unknown quantity of clusters is described. A method of finding matrix of similarity is improved. The method is based on the statistics of key terms occurrence in documents. For quality analysis and finding of limiting values of algorithm, there was used a function of competitive similarity improving. The approach is realized as the application server SmartBase’s application. Implementation details and results of the process are shown. Russian text set is used.

Full text (pdf)