Finding Duplicate Files on my Laptop with PowerShell Because That Is the Tool I Actually Know - Packet Flow: Journey to Network & Cybersecurity Expertise

I was cleaning up files on my laptop and ran into a familiar problem: duplicate files.

Nothing dramatic. No emergency. No incident response plan. No dashboard glowing red like it wants attention.

Just files.

The same document in one folder.

Another copy somewhere else.

Another one sitting around because at some point I probably thought, “I’ll clean this up later.”

Of course I did not.

So I wrote a PowerShell script to find exact duplicate files and write the results to a tab-delimited file.

That is the whole job.

The script does not delete files. It does not move files. It does not rename anything. It does not pretend to know which copy matters.

Good.

I wanted a report, not a digital wood chipper.

The script ran and created the logs. That is already a decent first version.

Why Ubuntu?

Because this laptop runs full Ubuntu.

Not WSL.

Not Windows pretending to be Linux.

Not a dual-boot TED Talk.

Full Ubuntu.

I also have PowerShell installed on it. Microsoft has official documentation for installing PowerShell on Ubuntu, so this is not some weird back-alley trick. It is supported and documented (Microsoft, 2026).

A lot of PowerShell examples still assume Windows paths like this:

C:\Users\Teo\Documents

That is not what I am using on this laptop.

My paths look more like this:

/home/support/Documents
/home/support/Downloads
/

So the script needed to work with Linux paths because that is where my files are.

That is really the whole explanation. I am using Ubuntu, so the script should work on Ubuntu.

Wild idea. Write the script for the computer in front of you.

Why PowerShell and Not Bash or Python?

Yes, I know.

Somebody will ask, “Why not Bash?”

Somebody else will ask, “Why not Python?”

And then somebody will recommend some command-line tool with a name that sounds like a robot sneezed.

Fair enough.

The answer is simple: I know PowerShell better.

That is it.

Bash can do this. Python can do this. A lot of tools can do this. I am just still green on Bash, and I am still working on Python.

Yes, still green there too.

I used PowerShell because I could actually build the thing in PowerShell without turning a duplicate-file cleanup script into a month-long identity crisis.

PowerShell also handles objects nicely. That helps. File name, path, size, hash, creation time, modified time, those can all be treated as properties. Then I can group them, sort them, and write them to a log without spending my afternoon doing text parsing gymnastics.

I used the tool I know best right now.

A shocking scandal. A person used a familiar tool to solve a real problem.

Also, I just love Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS by Travis Plunk, James Petty, and Tyler Leonhardt.

No, they did not sponsor this post.

Would it be cool if they saw it? Absolutely.

That book helped make PowerShell feel practical to me. Not fancy. Not mysterious. Practical.

I also like Mastering PowerShell Scripting: Automate and Manage Your Environment Using PowerShell 7.1, Fourth Edition by Chris Dent.

Also not sponsored.

Again, would be cool if they saw it.

Those books are part of why I keep coming back to PowerShell. I know I still need to get better at Bash and Python. I am working on it. Slowly. With occasional muttering at the screen.

For this project, PowerShell was the tool I could use today.

That mattered more than winning imaginary internet points.

What the Script Does

The script scans a folder, multiple folders, or the full Ubuntu filesystem.

Examples:

pwsh ./Find-DuplicateFiles.ps1 -ScanPath "/home/support"

pwsh ./Find-DuplicateFiles.ps1 -ScanPath "/home/support/Documents"

pwsh ./Find-DuplicateFiles.ps1 -ScanPath "/"

It then creates three output files:

DuplicateFiles-YYYYMMDD-HHMMSS.tsv
DuplicateFiles-Errors-YYYYMMDD-HHMMSS.tsv
DuplicateFiles-Summary-YYYYMMDD-HHMMSS.txt

The main file is the TSV report. TSV means tab-separated values. Excel and LibreOffice Calc can open it easily, which is good because I want to review the results, not fight the file format.

The error file lists folders or files the script could not access.

The summary file shows the basic numbers: files found, files checked, files hashed, duplicate groups found, and where the logs were saved.

Not glamorous.

Useful.

I like useful.

The Basic Setup

Before anything useful happens, the script creates a timestamp and builds the log file names.

That keeps each scan separate. It also keeps me from overwriting yesterday’s results because I got excited and ran the script again like a raccoon pressing buttons.

$TimeStamp = Get-Date -Format "yyyyMMdd-HHmmss"

$DuplicateLog = Join-Path $LogFolder "DuplicateFiles-$TimeStamp.tsv"
$ErrorLog     = Join-Path $LogFolder "DuplicateFiles-Errors-$TimeStamp.tsv"
$SummaryTxt   = Join-Path $LogFolder "DuplicateFiles-Summary-$TimeStamp.txt"

That gives me output files like this:

DuplicateFiles-20260614-120120.tsv
DuplicateFiles-Errors-20260614-120120.tsv
DuplicateFiles-Summary-20260614-120120.txt

Simple. Clean. Traceable.

Good enough for my laptop. Good enough for later review.

What Counts as a Duplicate?

The script uses two checks:

same file size
same file hash

Both must match.

A matching filename is not enough. A matching date is not enough. A matching folder does not matter. A vague memory that I saw the file before is also not enough, although that is usually how my cleanup efforts begin.

The script looks for exact duplicate files. That means the contents are the same.

Why Start with File Size?

File size is quick to check.

If one file is 50 KB and another file is 12 MB, they are not exact duplicates. No need to turn that into a courtroom drama.

The script first groups files by size:

$CandidateSizeGroups = $FilesToCheck |
    Group-Object -Property Length |
    Where-Object { $_.Count -gt 1 }

In plain English:

Put files with the same size together.
Keep only groups that have more than one file.
Ignore files with unique sizes.

This matters because files with unique sizes cannot be exact duplicates. No matching size, no need to hash.

This saves time because hashing every single file right away would be slower. Hashing means the script has to read the file contents. That is fine for a few files. It is less fun when the laptop starts going through years of PDFs, photos, installers, exports, and other digital clutter I apparently kept for future archaeologists.

File size is the first filter.

It does not prove duplication.

It just tells the script which files deserve a closer look.

What the Hash Does

After the script finds files with the same size, it calculates a hash for those files:

$Hash = Get-FileHash -LiteralPath $File.FullName -Algorithm $HashAlgorithm -ErrorAction Stop

PowerShell’s Get-FileHash calculates a hash value for a file using the selected hash algorithm, which makes it useful for checking file contents instead of trusting filenames (Microsoft, n.d.-a).

A hash is like a fingerprint for the file’s contents.

Rename the file? The hash stays the same.

Move the file? The hash stays the same.

Change the file? The hash changes.

That is why the hash matters.

The script uses SHA256 by default. SHA256 is part of the Secure Hash Standard published by NIST, which specifies hash algorithms used to generate message digests and detect whether data has changed (National Institute of Standards and Technology [NIST], 2015).

When the hash is calculated, the script stores the file details in an object:

$HashedFiles.Add([PSCustomObject]@{
    FileName      = $File.Name
    Location      = $File.DirectoryName
    FullPath      = $File.FullName
    SizeBytes     = $File.Length
    SizeMB        = [Math]::Round(($File.Length / 1MB), 2)
    HashAlgorithm = $HashAlgorithm
    Hash          = $Hash.Hash
    CreationTime  = $File.CreationTime
    LastWriteTime = $File.LastWriteTime
}) | Out-Null

That object is the useful part.

It gives me structured data. Not a blob of terminal text. Not a pile of strings. Real fields I can group, sort, and write to a TSV.

So if these two files have the same size and the same SHA256 hash:

/home/support/Documents/report.pdf
/home/support/Desktop/report-copy.pdf

the script treats them as duplicates.

Different path. Maybe different filename. Same content.

That is the part I care about.

The Actual Duplicate Check

This is the part that decides which files are duplicates:

$DuplicateGroups = $HashedFiles |
    Group-Object -Property SizeBytes, Hash |
    Where-Object { $_.Count -gt 1 } |
    Sort-Object Count -Descending

In plain English:

Group files by size and hash.
If a group has more than one file, list it as a duplicate group.

Example:

File A
Name: contract.pdf
Size: 845,120 bytes
Hash: AAA111

File B
Name: contract-copy.pdf
Size: 845,120 bytes
Hash: AAA111

File C
Name: contract-old.pdf
Size: 845,120 bytes
Hash: BBB222

File A and File B match.

File C does not.

Even though File C has the same size, the hash is different. That means the contents are different.

Similar is not duplicate.

Close is not duplicate.

Exact is duplicate.

Building the Duplicate Report

After the duplicate groups are found, the script assigns each group a number.

That number matters. It lets me sort the TSV and see which files belong together.

$DuplicateResults = New-Object System.Collections.Generic.List[object]
$DuplicateGroupNumber = 1

foreach ($Group in $DuplicateGroups) {
    $FilesInGroup = $Group.Group | Sort-Object Location, FileName

    foreach ($File in $FilesInGroup) {
        $DuplicateResults.Add([PSCustomObject]@{
            DuplicateGroup = $DuplicateGroupNumber
            FileName       = $File.FileName
            Location       = $File.Location
            FullPath       = $File.FullPath
            SizeBytes      = $File.SizeBytes
            SizeMB         = $File.SizeMB
            CreationTime   = $File.CreationTime
            LastWriteTime  = $File.LastWriteTime
            HashAlgorithm  = $File.HashAlgorithm
            Hash           = $File.Hash
        }) | Out-Null
    }

    $DuplicateGroupNumber++
}

This is where the report becomes usable.

Without the duplicate group number, the report would still technically list duplicates, but reviewing it would be more annoying. And I am trying to reduce annoying. I have enough of that built in.

Writing the TSV

The script writes a TSV file with these fields:

duplicate group
filename
location
full path
file size bytes
file size MB
date created
date modified
hash algorithm
hash

The header is created like this:

$DuplicateLines.Add(
    "duplicate group`tfilename`tlocation`tfull path`tfile size bytes`tfile size MB`tdate created`tdate modified`thash algorithm`thash"
)

Each duplicate file gets written as one tab-separated line:

foreach ($File in $DuplicateResults) {
    $DuplicateLines.Add(
        "$(Format-ForTsv $File.DuplicateGroup)`t" +
        "$(Format-ForTsv $File.FileName)`t" +
        "$(Format-ForTsv $File.Location)`t" +
        "$(Format-ForTsv $File.FullPath)`t" +
        "$(Format-ForTsv $File.SizeBytes)`t" +
        "$(Format-ForTsv $File.SizeMB)`t" +
        "$(Format-DateForTsv $File.CreationTime)`t" +
        "$(Format-DateForTsv $File.LastWriteTime)`t" +
        "$(Format-ForTsv $File.HashAlgorithm)`t" +
        "$(Format-ForTsv $File.Hash)"
    )
}

$DuplicateLines | Out-File -FilePath $DuplicateLog -Encoding UTF8

The most useful field is duplicate group.

Every file with the same duplicate group number matches the other files in that group.

Example:

duplicate group    filename              location
1                  report.pdf            /home/support/Documents
1                  report-copy.pdf       /home/support/Desktop
2                  invoice.pdf           /home/support/Downloads
2                  invoice-backup.pdf    /home/support/Documents/Archive

Group 1 is one duplicate set.

Group 2 is another duplicate set.

That makes review easier. Sort by duplicate group, look at each set, decide what to keep.

No guessing. Less squinting. Fewer “wait, what am I looking at?” moments.

Finished scan. The logs are written. The file counts also explain why this did not finish in twelve seconds.

Yes, It Is Slow

Let me say this now before anyone thinks I am pretending this thing is optimized.

Yes, the script is slow.

I know.

I have not optimized it.

I do not have the slightest idea how to optimize it properly yet. That is not false modesty. That is just where I am with it.

Right now, the script works. Next, I need to learn how to make it better.

The slow part is mostly the hashing. When the script hashes a file, it has to read the file contents. That is fine for small files. It gets less charming when it starts chewing through large PDFs, videos, ISO files, backups, and whatever else I kept because “I might need this later.”

Sure, future me. Sure.

Right now, I built it for correctness first.

Speed can come later.

Some things I want to read more about:

better filtering before hashing
parallel processing
excluding large folders
limiting by file type
faster file enumeration
saving previous hash results
scanning only changed files

Do I know how to do all of that cleanly today?

No.

Am I going to start reading on it?

Yes.

That is part of the fun. Build something. Run it. Notice what is ugly. Learn why. Improve it.

Annoying? Yes.

Learning? Also yes.

Why Dates Are Logged but Not Used

The TSV includes date created and date modified.

Those dates are useful when I review the results. Maybe I want to keep the newest copy. Maybe I want the one in the better folder. Maybe I want the one that looks like it belongs to a real filing system and not a panic backup from 2022.

But dates are not used to prove duplication.

A file can be copied and get a new creation date. A restored file can have odd timestamps. A file can be moved around and still be the same file. The LastWriteTime property can show when a file or directory was last written to, but that does not make it a duplicate detector by itself (Microsoft, n.d.-b).

Dates help with review.

They do not prove content.

Why File Names Are Not Enough

File names are useful, but they are not proof.

These could be different files:

budget.xlsx
budget.xlsx

These could be the same file:

budget.xlsx
budget_revised_final_actual_final.xlsx

The name is just a label. Sometimes it helps. Sometimes it is a tiny confession that the file went through too many versions.

So the script records the filename in the report, but it does not use the filename to decide whether two files are duplicates.

The filename is for review.

The hash is for proof.

About the Progress Bar That Still Hates Me

I wanted a progress bar.

Still do.

I have been trying to get a decent progress bar working since I wrote code for automating a daily backup and database restore.

And guess what?

It still has not worked the way I want.

Wonderful. Character development, but with more terminal output.

The idea sounds simple enough: show progress while the script runs.

How hard could that be?

Then the script reminds me that “sounds simple” is usually where the trouble starts.

The problem is that progress bars need something useful to measure. That is easy when you know the total number of things ahead of time. It is less easy when the script is still discovering files, walking folders, skipping system paths, hitting permission issues, and then hashing only the files that might actually be duplicates.

So yes, I wanted a real progress bar.

No, I do not have one working cleanly yet.

Right now, the script gives me basic status output. It shows what it is doing and which path it is scanning. Not pretty. Not fancy. Not the dashboard of my dreams. But at least I can tell the thing is alive.

Something like:

Scanning target: / [Ubuntu]
Current path: /home/support/Documents
Finding files with matching sizes...
Calculating file hashes...
Grouping duplicate files...

That is not a real progress bar.

That is more like the script yelling from the other room, “Still working!”

Fine. I will take it.

The progress bar goes on the list of things I still need to learn, right next to performance tuning, better file enumeration, and not starting “small” scripts that somehow turn into unpaid software projects.

For now, the script works.

It logs duplicates.

It does not delete anything.

And the progress bar can continue being my tiny unfinished villain.

Basic scan status. Not a real progress bar yet. More like the script waving from across the room saying it is still alive.

Why System Folders Are Skipped by Default

On Ubuntu, some folders are not normal places where personal documents live.

Examples:

/proc
/sys
/dev
/run
/tmp
/var/tmp
/snap
/lost+found

The script skips common system folders by default.

Can I scan everything?

Yes.

pwsh ./Find-DuplicateFiles.ps1 -ScanPath "/" -IncludeSystemFolders

Would I normally do that?

No.

That is how you get permission errors, strange pseudo-files, and a long wait for very little benefit.

Most of the time, I am looking for duplicate user files. I am not trying to inspect every pipe and crawlspace in the operating system.

What This Script Will Find

This script is good for exact duplicates:

same PDF copied twice
same photo saved in two folders
same video file duplicated
same document with a different name
old backups containing identical files

It will not find “close enough” matches.

It will not catch:

edited photos
cropped images
same video exported at another resolution
Word files with small changes
PDFs generated from the same source but saved differently

Those files may look similar, but their contents are different. Different content means different hash.

This script is not doing fuzzy matching.

It is not doing image comparison.

It is not guessing.

That is fine. Exact duplicate detection is already useful enough for version one.

Why This Could Be Useful at Work, Eventually

Now let me climb onto the sysadmin high horse for a minute, but not too high. I still need to get down without embarrassing myself.

This is not just a “my laptop is messy” script. The idea behind it could be useful at work too, especially in places where file shares have been collecting duplicate junk since the era of beige monitors and printer jams that sounded like farm equipment.

You know the folders.

Shared
Shared Old
Shared New
Shared Final
Shared Final 2
Archive
Archive Old
Do Not Delete
Actually Do Not Delete

A beautiful little museum of panic, indecision, and people naming files while under stress.

A duplicate file report could help with things like:

file server cleanup
shared drive review
department folder cleanup
migration planning
backup size reduction
archive review
storage growth analysis
preparing for cloud migration
finding duplicate exports and reports

That said, I would not run this on a production server yet.

Not yet.

At least not until I figure out how to make it faster or, in more pretentious words, “optimized for efficiency.”

There. Now it sounds like it belongs in a project charter.

Right now, this script works, but it is slow. That is fine on my laptop. My laptop can complain. It has no board meeting to attend. A production file server is a different animal. People are using it. Backups may be running. Antivirus may be scanning. Users may be opening files. The last thing I need is my little duplicate-file science project chewing through disk I/O while someone is trying to access their department folder.

That is how a useful script becomes a ticket generator.

So for production, I would want to improve a few things first:

better filtering before hashing
smarter folder exclusions
faster file enumeration
optional file type filtering
hash caching
incremental scans
scheduled off-hours runs
clear logging
testing on a non-production copy first

That last one matters.

Production is not where you “see what happens.”

Production is where you bring something that already behaved itself somewhere else.

Still, the idea is useful. A clean duplicate report can help before doing storage cleanup, archive review, or migration planning. It gives you something you can hand to the file owner and say:

Here are the duplicate groups.
Here are the paths.
Here are the sizes.
Here are the hashes.
Please tell me what stays.

That is the important part.

As a sysadmin, I do not want to randomly delete department files because a script says two files match. That is how you become the villain in a meeting.

The useful part is that the script separates detection from decision-making.

The script finds the exact duplicates.

The business owner decides what can go.

That is how it should work at work, especially in production.

Also, the TSV format helps. It is not fancy, but it opens in Excel or LibreOffice Calc. People understand spreadsheets. Managers understand spreadsheets. Auditors understand spreadsheets. Nobody wants to read raw terminal output unless they have made very poor life choices.

With the duplicate group column, the review is cleaner. Every file in the same group has the same size and hash. That means it is an exact content match. Not a guess. Not a filename match. Not “looks close enough.” Exact.

So the better production conversation is not:

I think these are duplicates.

It is:

These files have the same size and same SHA256 hash. They appear to be exact duplicates. Please review before deletion.

Boring sentence.

Good.

Boring is what I want in production.

Exciting production changes usually come with email threads, root cause analysis, and someone saying “just checking in” when they are absolutely not just checking in.

So yes, this script could be useful at work.

Eventually.

After testing.

After tuning.

After making it less slow.

After I learn how to say “optimized for efficiency” without making myself roll my eyes.

For now, it is a good laptop script and a good learning project.

That is enough.

The Important Safety Part

The script does not delete anything.

That is deliberate.

Finding duplicates is one job.

Deleting files is another job.

I want the first job automated. I want the second job reviewed by a human who has had coffee and is not rushing.

The report gives me the evidence.

I still make the decision.

Is that slower?

Yes.

Is it safer?

Also yes.

Closing Thought

So that is it.

I wrote a duplicate file scanner in PowerShell on Ubuntu.

It is slow.

It is not optimized.

The progress bar is still being a little jerk.

But it works.

For now, that is enough.

References

Dent, C. (2021). Mastering PowerShell scripting: Automate and manage your environment using PowerShell 7.1 (4th ed.). Packt Publishing.

Microsoft. (n.d.-a). Get-FileHash (Microsoft.PowerShell.Utility). Microsoft Learn. Retrieved June 14, 2026, from https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.utility/get-filehash

Microsoft. (n.d.-b). FileSystemInfo.LastWriteTime property. Microsoft Learn. Retrieved June 14, 2026, from https://learn.microsoft.com/en-us/dotnet/api/system.io.filesysteminfo.lastwritetime

Microsoft. (2026, March 31). Install PowerShell 7 on Ubuntu. Microsoft Learn. https://learn.microsoft.com/en-us/powershell/scripting/install/install-ubuntu

National Institute of Standards and Technology. (2015). Secure Hash Standard (SHS) (FIPS PUB 180-4). U.S. Department of Commerce. https://csrc.nist.gov/pubs/fips/180-4/upd1/final

Plunk, T., Petty, J., & Leonhardt, T. (2022). Learn PowerShell in a month of lunches: Covers Windows, Linux, and macOS (4th ed.). Manning.

Why Ubuntu?

Why PowerShell and Not Bash or Python?

What the Script Does

The Basic Setup

What Counts as a Duplicate?

Why Start with File Size?

What the Hash Does

The Actual Duplicate Check

Building the Duplicate Report

Writing the TSV

Yes, It Is Slow

Why Dates Are Logged but Not Used

Why File Names Are Not Enough

About the Progress Bar That Still Hates Me

Why System Folders Are Skipped by Default

What This Script Will Find

Why This Could Be Useful at Work, Eventually

The Important Safety Part

Closing Thought

References

Leave a Reply Cancel reply