What do you mean by “deduplicate”?

July 11, 2020

Browsing with PhotoStructure is designed to be fast and fun.

As you navigate through your photos and videos, and you have duplicate photos or videos, clicking “next” or “previous” can result in seeing the same thing. But wait: did you not click the button? Is this a bug? Either way: these browsing stutters aren’t fun.

To avoid this, PhotoStructure automatically detects duplicate photo and video variations, and only shows you the “best” variant.

Why you may have duplicates 🔗

There are several reasons why you might have 2 or more copies or variations of any given photo or video:

RAW+JPEG pairs 🔗

Most current digital cameras and even some smartphones support “shooting raw.”

These raw files encode higher sensor fidelity than JPEGs. This additional information can allow you to “post-process” files to get better dynamic range, restore highlight and shadow details, and adjust color balance, with much more flexibility than an JPEG.

Unfortunately, raw images are slow to process, and many image applications can’t handle these files. Most cameras allow shooting in “RAW+JPEG,” where each time you push the shutter button, a JPEG file as well as a RAW image file is written to your memory card. If PhotoStructure didn’t know that these are actually the same image, you’d see two (or more) photos with the same image while browsing your PhotoStructure library.

Cloud backups 🔗

Several photo cloud backup services downsample your photos and videos, and strip much of the metadata from your files, as well.

If you download a local backup from your cloud service, these photos and videos will be duplicates of your original files.

Local edits 🔗

When you make edits to your images, some software will write to a new file rather than overwriting your original.

Local backups 🔗

If you’ve used backup software you’ll have several copies of your photos and videos where the backup destination was configured.

How this relates to automatic organization 🔗

If you’ve enabled automatic organization, PhotoStructure errs on the side of caution, and copies each valid, unique image into your library.

If exactly the same file is found (i.e., precisely the same stream of bytes on disk), it won’t be copied into your library again. All other variants to the image, though, will be copied.

As an example, in the above cases, both the raw and JPEG files will be copied into your library, as well as any unique files from cloud service backups, and local edits.

How files are aggregated 🔗

A number of metadata tags are examined in each file, and if both files have a value for a given tag, and they substantively differ, the files are considered to be different assets.

If the captured-at time matches, but an insufficient number of other metadata tags match, PhotoStructure will compare the actual images of the files. If they are substantively different, the files are considered to be different assets.

You can use the info tool to compare files and see if PhotoStructure considers them eligible to be associated to the same asset.

How does PhotoStructure pick which file to show? 🔗

In general, PhotoStructure picks the “best” image or video variation with the largest resolution that lives in your library.

In an effort to make PhotoStructure’s “best” pick be predictable, though, there are a number of other file metadata attributes that PhotoStructure also uses. The variantSortCriteria library setting allows you to customize how PhotoStructure picks your library’s “best”.

Here’s the list of those fields, in default priority order, as of v2023:

resolution: the coarse image resolution. Similar megapixel resolution differences are considered equivalent.
schemeIdx: captures “where the file resides” (it references the asset file URI scheme). This prefers files stored in your library over files found outside your library.
capturedAtPrecision: variations that contain more reliable captured-at metadata will be preferred.
metadataCoverage: prefer files with more fields with metadata we care about
isBrowserSupported: prefer files we can directly stream to the browser without re-rendering or transcoding
isEditOrUpdate: prefer files whose basename includes “edit” or “update”. Many editing applications will save “file-updated.jpg” instead of overwriting the original file.
isCover: If we have a burst files, prefer the “burst cover”
count: If there are many copies of a file (image.jpg, image (1).jpg, image (2).jpg), prefer the one with the highest number (assuming that’s the latest copy)
mtime: prefer that newest version. Note that many backup applications don’t retain mtime correctly, so we don’t really trust this value
basename: this helps make sorting deterministic if all other factors are the same
parentBasename: this helps make sorting deterministic
uri: this is simply to make sorting deterministic if all other factors are the same

FAQ
Docs

Hi! I’m Matthew, the author of PhotoStructure. I’m a dad, amateur photographer, and author of open source, adtech, fintech, and edtech software since the turn of the millennium. He/him. I am building a self-hosted, safe, and fast new home for all your photos and videos.
Your memories deserve PhotoStructure. Try it out for free!