Quantcast
Channel: smilingspider
Viewing all articles
Browse latest Browse all 10

OpenCV Python: Image Duplicate Detection Via OpenCV’s Hashing Module & Hamming Distance (Part 1)

$
0
0

In this post, I will explore a method to detect duplicated images in a set of images. The method relies on OpenCV’s Hashing capabilities. The idea is to generate a Hash String for each image, keep track of these strings in a dictionary and mark those that are similar as duplicates.

Aside from the Hashing method, the main problem that arises is the comparison between two hash strings. If two images produce the same string, they should be the same (or similar) image. Looking for (and comparing) one element in an array of elements has O(n) average time complexity.

The use of a dictionary comes as a natural solution for this, as we can quickly determine if one key already exists inside this data structure. String storing can be visualized in the following picture:

The only entry on this dictionary will represent the original image. The Hash String will serve as key. The value will be a list containing all duplicated Hash Strings along their images. We can include an additional flag to mark the original and duplicated images – this will be useful if we need to explicitly loop through the duplicates. The final Hash dictionary will look like this:

We will be using OpenCV’s Hashing module. The hashing module computes a unique array of values as function of an input image. This module offers multiple hashing methods, we will use the BlockMeanHash method.

We can downsample the image before hashing. This will offer a certain level of robustness against small image variations, however, if pixel content varies too drastically (i.e., corrupted by noise, level or brightness variations, etc.), it will yield different keys. For this, we can compare strings using some similarity metric based on distance. One option is to use the Hamming Distance, which is used to measure string similarity. This metric will tell us by how much characters two strings differ.

In the first part we will cover the basic hashing method. In the second part, we will include the Hamming Distance implementation to detect similar Hashing Strings produced by similar images; Let’s check the code. We will need the following helping function: showImage. It displays an image in a new window using OpenCV’s High GUI:

# Defines a re-sizable image window:
def showImage(imageName, inputImage):
    cv2.namedWindow(imageName, cv2.WINDOW_NORMAL)
    cv2.imshow(imageName, inputImage)
    cv2.waitKey(0)

Also, we will be using the following two test images:

Let’s create a list in which all the images will be stored. I’ll add the first image two times, as the objective of this script is to detect (and, eventually, delete) image duplicates:

# The images are stored here:
imageList = []

# Check out the images:
for currentPath in [imagePath1, imagePath1, imagePath2]:
    # Read images:
    currentImage = cv2.imread(currentPath)
    # Store images:
    imageList.append(currentImage)
    showImage("Current Image", currentImage)

Alright. Let’s get over the hashing method. We will rescale the image to 25% of the original scale, convert the image to grayscale and feed it to the hashing object. As previously stated, will use a dictionary to store the Hashed Strings along their image. Additionally, I’ll use a simple list, keyList, to keep track of the unique Hash Strings (excluding duplicates) generated. This will especially come handy for measuring distance between strings, in the second part of this post.

For now, let’s set up these variables and instantiate the OpenCV Hashing object:

# Set the "hashing scale":
imageScale = (0.25, 0.25)

# Prepare the dictionary of duplicated hashes:
hashTable = {}

# Prepare the image hashing object:
hasher = cv2.img_hash.BlockMeanHash_create()

# Store unique images/keys here:
keyList = []

The Hash String is obtained with a call to the Hashing object. Now, the hash is returned as a numpy array of 32 8-bit integers. We will convert this numerical array into a string made out of hexadecimal digits. This will make the storing and comparison between the strings easier.

We will define a getHashString function that will perform the conversion from numpy array to hex-string. The hashing array returned by the hashing object comes nested in an external array. A quirk from the OpenCV’s Python bindings and their handling of the cv::Mat C++ object underneath. The function is straightforward – it loops through the inner array, extracts each integer and converts it to an hexadecimal nibble.

It is important to keep the proper spacing between nibbles, as this can affect out distance measure later on while comparing two hex-strings. If a decimal number is encoded using only one nibble, we will pad it with an extra zero in the most significant position. We will also drop the hexadecimal base prefix (0x) after each conversion:

# Define the hash 32 8-bit decoder function:
def getHashString(inputArray):
    # Store the hash string here:
    outString = ""
    for i in range(inputArray.shape[0]):
        # Get int from array, convert to hex and then to string:
        hexChar = str(hex(inputArray[i]))
        # Discard the "0x" prefix:
        hexChar = hexChar[2:]
        # Each int is encoded using two nibbles:
        if len(hexChar) == 1:
            # Pad the most significant nibble with "0"
            hexChar = "0" + hexChar

        # Concatenate:
        outString = outString + str(hexChar)[-2:]
    # Done:
    return outString

Now, let’s process the images. The following snippet will loop through every image stored in the imagesList variable. Every image is converted to grayscale and resized. The Hash String will be used as key and the image stored as a value in the Hash dictionary. We expect that similar images will produce the same key. We will use a list for every key and we will append all the duplicates as they are encountered. An additional flag will be used to identify unique images (first entry) from its duplicates:

# Process the images:
for i, currentImage in enumerate(imageList):

    # BGR to Gray:
    grayImage = cv2.cvtColor(currentImage, cv2.COLOR_BGR2GRAY)

    # Scale down the image by the scale factor:
    grayImage = cv2.resize(grayImage, None, fx=imageScale[0], fy=imageScale[0])

    # Compute image hash:
    imageHash = hasher.compute(currentImage)

    # Convert the 32 8-bit bit array into an hex-string:
    hashString = getHashString(imageHash[0])

    print("Image:", i, "Hash:", hashString)
    showImage("Gray Image" grayImage)

    # Into the hash table. Every entry is a list item:
    if hashString not in hashTable:
        # First entry, unique flag is True:
        hashTable[hashString] = [[currentImage, True]]
        # Unique keys into the list:
        keyList.append(hashString)
    else:
        # Subsequent entries, append duplicates. 
        # Unique flag is False:
        hashTable[hashString].append([currentImage, False])

Resizing the images helps in ignoring small differences between similar images, producing the same hashing strings. This is hashing result, along the images that produced each string:

Image 0 and Image 1 produce the same hex-string. Neat. Let’s check out the number of unique entries:

# Get number of unique images:
uniqueImages = len(keyList)
print("Unique images: ", uniqueImages)
>> Unique images:  2

Now, let’s keep only the unique entries in the Hash dictionary. We will check the unique flag to identify original entries and delete duplicates. This bit loops through each key and retrieves its duplicate list (if any). We will also loop through the duplicates list and delete each entry manually. We will be careful to iterate through the list in reverse and pop the last element:

# Check items on the hash table:
for key in hashTable:

    # Get dict entry:
    currentDuplicates = hashTable[key]
    # Get key duplicates:
    totalDuplicates = len(currentDuplicates)

    # Print how many duplicates are found for this particular image (key):
    print("Examining duplicates for: ", key, "Duplicates found: ", totalDuplicates)

    # Check out duplicates:
    for i in range(totalDuplicates - 1, 0, -1):

        print(" Duplicate: ", i)
        # Get the current duplicate:
        currentList = currentDuplicates[i]
        # Get duplicate flag:
        duplicateFlag = currentList[1]

        # Remove duplicate:
        if not duplicateFlag:
            hashTable[key].pop()
            print(" - Removed element: ", i)

The output in the console reads:

Examining duplicates for:  ffff7ff83ff83ff83ff83ff83ff83ff813f807f003e003c003c003c007800780 Duplicates found:  2
 Duplicate:  1
 - Removed element:  1
Examining duplicates for:  ffffefe74fe10fe00ff01ff80fe00fe00ff01ff41ff20ff00fe00fe000000000 Duplicates found:  1

A duplicate was found for the first key. It gets promptly removed. Finally, let’s check out the final images stored in the Hash dictionary:

# Show original images:
for key in hashTable:

    # Get image count (should be 1 -> the original entry):
    imageCount = len(hashTable[key])
    print("Key: ", key, " images: ", imageCount)

    # Get actual image
    for i in range(imageCount):
        currentImage = hashTable[key][i][0]
        showImage("Final Image", currentImage)

Which shows the unique images stored in the dictionary, along the following output:

Nice, looks like the duplicated image got indeed removed. However, if two similar images differ too much in pixel-level content, the hashing will produce two different strings. In the next post we will explore a method to compare two Hash Strings in order to determine if they represent the same image or not.

A notebook containing the full example code is available here:

https://github.com/gone-still/ai/tree/main/computerVision/imageHashing


Viewing all articles
Browse latest Browse all 10

Trending Articles