การคัดลอกไฟล์หลายชนิดซ้ำซ้อนในหลาย ๆ แหล่ง

We are faced with a situation where data has been backed-up to several external mediums and we are undergoing an exercise to consolidate the data. The data is comprised of binary files, audio, video, compressed archives, virtual machines, databases, etc.

Is it a best practice to copy all the files to a single source prior to deduplicating the data or is it normal to run the procedure across multiple media?
Is it best to run file-level or block-level deduplication? I am aware of the technical differences but am unclear why you would choose one over the other. We are after accuracy as opposed to performance

EDIT

เมื่อฉันพูดคัดลอกฉันหมายความว่าเราจะคัดลอกแต่ละแหล่งไปยังไดรฟ์เดียวหรือ NAS แต่ละแหล่งจะแสดงโดยไดเรกทอรี ข้อมูลทั้งหมดจะถูกเก็บไว้ในฮาร์ดไดรฟ์ภายนอก วัตถุประสงค์คือเพื่อลดความซ้ำซ้อนของข้อมูลและมีแหล่งความจริงเพียงแหล่งเดียว

deduplication

— มีสาเหตุ
แหล่งที่มา

The paid version of CCleaner can detect duplicate files. I don't know if it scans network drive locations. Your actual question is not all that clear.

— Ramhound

How would you copy it all to a single source? Are you talking about a single drive instead of several network locations? Are you talking one folder vs multiple folders? What about some HDD, SDD and other removable media? Please clarify.

— Raystafarian

Tools like rsync can manage the comparison operations and the moving of bits back and forth, but you're going to have to supply your own logic about which version of the data is canonical.

Is it best to run file-level or block-level deduplication?

This part of your question is easy, at least: you should never need to care about what is going on at the block level.

— Robert Calhoun
แหล่งที่มา

My understanding is that rsync uses file level checksums to verify duplicate data. How accurate is this? Why use rsync over opendedup for example? I assume that it's better to use block level deduplication.

— Motivated

By default rsync only compares file size and modified time. If you use the --checksum option (which I would recommend as it doesn't take that long) it will compute checksums on each file as it goes. I am not sure what algorithm rsync uses; probably MD5. MD5 is no longer considered cryptographically secure but it's fine for declaring two files to be identical (with very high probability.) I am not familiar with opendedup. Rsync is powerful but notoriously difficult to use; there certainly might be more appropriate tools out there.

— Robert Calhoun

How does that differ from standard file level deduplication?

— Motivated

It differs in that you do not have to actually transfer the file over the network to do the comparison. Re-reading your question, I see that your data is on external hard drives, so you don't care about this feature of rsync and it is probably not the right tool. Sorry.

— Robert Calhoun