module Wshiml: sig
.. end
Word shingling for near duplicate document detection.
type
sketch
A sketch of one document. A sketch is the set of minimal elements
that result from xor-ing the hash of shingles with a list of
random hashes.
type
sketchdocs = (sketch * int) list
List of sketched documents – the second part of the tuple is a
document ID.
type
tokeniser = bytes -> bytes list
The functions taking documents let you specify your own tokeniser.
If a tokeniser is not provided, a very simple
whitespace/punctuation tokeniser is used.
The n
in the below functions is the number of shingles to use;
defaults to 4.
val compare_docs : ?tokenise:tokeniser -> ?n:int -> bytes -> bytes -> float
Compares two documents "exactly", by taking the Jacquard
coefficient of the shingles themselves.
val sketch_of_doc : ?tokenise:tokeniser -> ?n:int -> bytes -> sketch
val sketch_docs : ?tokenise:tokeniser ->
?n:int -> ?slurp_file:(bytes -> bytes) -> bytes list -> sketchdocs
slurp_file
: If provided, called on each document before processing (lets you pass a list of filenames instead of a list of the full documents).
val supersketches : ?n:int -> sketchdocs -> sketchdocs
type
scoreddocs = ((int * int) * int) list
Document-pair (as indices) to score.
val score_sketches : ?threshold:float -> sketchdocs -> scoreddocs
type
clusters = int list list
List of clusters; documents referred to by their index in the
original list.
val cluster_scores : ?ndocs:int -> scoreddocs -> clusters
ndocs
: Number of documents. If not provided, uses the highest index in scoreddocs
.