Module Wshiml

module Wshiml: sig .. end
Word shingling for near duplicate document detection.

type sketch 
A sketch of one document. A sketch is the set of minimal elements that result from xor-ing the hash of shingles with a list of random hashes.
type sketchdocs = (sketch * int) list 
List of sketched documents – the second part of the tuple is a document ID.
type tokeniser = bytes -> bytes list 
The functions taking documents let you specify your own tokeniser. If a tokeniser is not provided, a very simple whitespace/punctuation tokeniser is used.

The n in the below functions is the number of shingles to use; defaults to 4.
val compare_docs : ?tokenise:tokeniser -> ?n:int -> bytes -> bytes -> float
Compares two documents "exactly", by taking the Jacquard coefficient of the shingles themselves.
val sketch_of_doc : ?tokenise:tokeniser -> ?n:int -> bytes -> sketch
val sketch_docs : ?tokenise:tokeniser ->
?n:int -> ?slurp_file:(bytes -> bytes) -> bytes list -> sketchdocs
slurp_file : If provided, called on each document before processing (lets you pass a list of filenames instead of a list of the full documents).
val supersketches : ?n:int -> sketchdocs -> sketchdocs
type scoreddocs = ((int * int) * int) list 
Document-pair (as indices) to score.
val score_sketches : ?threshold:float -> sketchdocs -> scoreddocs
type clusters = int list list 
List of clusters; documents referred to by their index in the original list.
val cluster_scores : ?ndocs:int -> scoreddocs -> clusters
ndocs : Number of documents. If not provided, uses the highest index in scoreddocs.