During processing each item generates an MD5 hash code. This hash code is utilized to identify if the item is unique or if it is a duplicate of an item already processed. Currently de-duplication is identified on Project, Custodian, and Collection levels. The first item processed into a project is determined to be the 'unique' item, the second, third, etc... items that have a matching MD5 hash code, are tagged as duplicate items. Project level duplicates are tagged with the System Tag, "Duplicate". Custodian level duplicate items are tagged with "Custodian Duplicate". Collection level duplicate items are tagged with "Collection Duplicate". It is possible for an item to have all three duplicate tag values.
Email items generate their MD5 hash code as outlined below.
*Email Deduplication Criteria:
We concatenate the following field data, then MD5 hash code the resulting string.
- Created Date (Sent Date/Time)
- From
- To
- CC
- BCC
- Subject
Please note that a parent email and its attachments are grouped and seen as one unit in the de-duplication process. This means that a standalone file, like a Word document, is not compared to the same Word document attached to an email. This ensures that the email communication is seen as a unit and would include all attachments. If the same email and attachment group is processed a second time, then both the email and the attachments would be seen as duplicates.
Again, the Word document attached to an email, and a standalone copy of the same Word document would not deduplicate against each other.
Non-Email items generate their MD5 hash code as outlined below.
Non-Email File:
We generate an MD5 hash code on the binary data of the file. This hash code is used to identify any other items with the same hash code. All items that match the 'first in' item, are tagged with 'Duplicate'.
*Projects created prior to 08/01/2017 will also include the 'Modified Date/Received Date' metadata field value in MD5 creation.
Comments
0 comments
Article is closed for comments.