Version 0.4.0 released / Future plans - 15/07/2008
Version 0.4.0 includes numerous tweaks and patches introduced since 0.2.0.
Here is the source code from the tool to illustrate an example of how this can be done. Deduplicator treats files only as duplicates if their content is the exactly the same. So it won't find similar image and audio files. This app doesn't delete any files you don't tell it to, but make sure to have a current backup of your data anyway. When your hard drive's cluttered with hundreds of files in folders scattered everywhere, chances are you're using up disk space with multiple copies of the same data. Whether multiple memory card.
- Support for changed crawl.log format that Heritrix introduced in 1.12.0.
- Improved memory usage for large indexes.
- Can now exclude duplicate URIs from new index.
- Various bug fixes.
This will be the last version of the DeDuplicator that is built against Heritrix 1.10.0. Building against that version of Heritrix has made the DeDuplicator compatible with almost all 1.x versions of Heritrix. Note though that 0.4.0 is built with Java 1.5, unlike 0.2.0 which was built with Java 1.4.2.
In version 1.12.0 Heritrix added some useful features that the DeDuplicator should make use of, most notably marking content as 'not novel' (i.e. duplicate). Also in 1.14.0 there is rudimentary WARC support and the aim is to have the DeDuplicator support writing to WARC files. Therefor, any future versions will be built against Heritrix 1.14.0.
Support for Heritrix 2.0 is planned but there is no set timeframe for it. This requires considerable changes to the DeDuplicator and will likely not be implemented until Heritrix 2.x is sufficiently mature that it is used routinely instead of 1.x for large scale production crawls.
The Record Deduplicator can compare entire records or a subset of fields. Use a subset of fields to focus the comparison on fields of concern. For example, to discard purchases that are accidentally submitted more than once, you might compare information about the purchaser, selected items, and shipping address, but ignore the timestamp of the event.
To enhance pipeline performance, the Record Deduplicator hashes comparison fields and uses the hashed values to evaluate for duplicates. On rare occasions, hash functions can generate collisions that can cause records to be incorrectly treated as duplicates.
The Record Deduplicator caches record information for comparison until it reaches a specified number of records. Then, it discards the information in the cache and starts over.
Windows File Deduplicator
You can configure a time limit to trigger a cache refresh at regular time intervals. When you configure a time limit, the time limit takes precedence over the record limit.
Deduplication Software For Windows
Squeezetube. When you stop the pipeline, the Record Deduplicator discards all information in memory.