Programmers run into a situation where a bug troubleshooting ends up being caused by a very small problem. I encountered an issue in yesterday's daily move that delayed me for most of the day, and finally identifying the cause left me speechless.
First a little background, I do algorithmic model training and the current job at hand is to iterate an algorithm and add the latest dataset to train a model with better accuracy.
Getting the dataset in labeled xml, I first did a format conversion and dataset splitting. The code used is as follows:
import os
import shutil
import as ET
def list_dir(path: str) -> str:
"""List all files in the directory"""
for item in (path):
yield item
def parse_xml():
base_path = "/home/lijinkui/Desktop/head_shoudler_20240824/Head and Shoulders Detection_V1.13_20240823083458_V1"
label_dir = f"{base_path}/gt"
count = 0
test_obj = open(f"{base_path}/test/test_ssd.txt", "a")
# train_obj = open(f"{base_path}/train/", "a")
for item in list_dir(label_dir):
xml_path = f"{label_dir}/{item}"
# print(xml_path)
# Parse from fileXML,Get root element
root = (xml_path).getroot()
filename = ('filename').()
width = int(('size').find('width').text)
height = int(('size').find('height').text)
# print(width, height)
count += 1
print(count)
box_list = []
for index, label in enumerate(('object')):
category = ('name').text
bbox = ('bndbox')
x1 = ('xmin').text
y1 = ('ymin').text
x2 = ('xmax').text
y2 = ('ymax').text
box_list.extend([x1, y1, x2, y2, "1"])
txt_string = " ".join(box_list) + "\n"
if count <= 1200:
# test_obj.write(txt_string)
# (f"{base_path}/images/{filename}", f"{base_path}/test/images/{filename}")
test_obj.write(f"# 20240824/{filename} \n")
test_obj.write(txt_string)
# else:
# train_obj.write(txt_string)
# # (f"{base_path}/images/{filename}", f"{base_path}/train/images/{filename}")
test_obj.close()
# train_obj.close()
if __name__ == '__main__':
parse_xml()
Save the first 1200 sheets as a test set
if count <= 1200:
# test_obj.write(txt_string)
# (f"{base_path}/images/{filename}", f"{base_path}/test/images/{filename}")
test_obj.write(f"# 20240824/{filename} \n")
test_obj.write(txt_string)
The saved format is:
# meili/
1796 550 1861 618 1
# meili/
1674 515 1749 585 1
# meili/
1527 473 1609 545 1
# meili/
1373 457 1455 531 1
Then I configured the parameters and happily started the training. As a result, it didn't run for long and reported an error. After troubleshooting, the code that reported the error is as follows:
def converter(args).
im_file, image_name, labels = args
try.
# This way gets the information about the image directly, which is faster to load, but will miss some image crashes
# im = (im_file)
# () # PIL verify
# img_w, img_h = exif_size(im) # (width, height)
# Use opencv read to be able to find crashed images in the dataset without affecting training
im = (im_file)
img_h, img_w = [:2]
tmp = []
for l, x1, y1, x2, y2 in labels.
x, y, w, h = (x1 + x2) // 2, (y1 + y2) // 2, x2 - x1, y2 - y1
x, y, w, h = x / img_w, y / img_h, w / img_w, h / img_h
([l, x, y, w, h])
return image_name, tmp
except Exception as e.
print("-------------------------")
print(im_file)
print(f"{im_file} has broken... : {e}") : {e}")
return None, None
The error message reported is:20240824/ has broken ... : NoneType has no shape。
My first thought was that the image might be corrupt, and the easiest way to verify this is to show it with the opencv library. Along these lines I opened a python terminal and showed the image in the terminal.
>>> import cv2 as cv
>>>
>>> image = ("/h3c_data/data/recognize_new_data/project_dataset/HeadShoulder/Test/Image/20240824/")
>>>
>>>
(1080, 1920, 3)
>>>
After checking it out so much, I realized that there's no problem either, the picture is clearly not damaged.
Then I guessed could it be a permissions or user group issue with the image making it unreadable? So I checked the permissions on the file
Read and write permissions and user groups are fine. Strange, what is the problem?
That's when I was ready to pull the breakpoint trick, hitting a breakpoint before the image was read, with the breakpoint going down step by step to see what was wrong. As a result, the breakpoint can not be hit. Because this function is placed in the thread pool execution, each time 8 threads concurrent execution, encountered a breakpoint on the direct exit.
with Pool(NUM_THREADS) as pool:
pbar = tqdm(pool.imap_unordered(converter, zip(image_pathes, image_names, labels)),
desc=desc, total=len(image_pathes))
for image_name, tmp in pbar:
if image_name:
dst_ret.write("%s" % image_name)
for l, *info, in tmp:
line = (l, *info)
dst_ret.write((" %d" + " %g"*len(info)) % line)
dst_ret.write("\n")
Breakpoints don't work either, so I'm out of tricks. I actually hit the endpoint to see if there was a problem with the file paths, and since I couldn't look at each one, I simply looked at all of them. So I printed the variable for the list of paths to all the images, and that's when I found the problem.
It turns out that each image path has an extra space after it, and it dawned on me that no wonder I didn't notice it when I was using the python terminal show, which doesn't print the space identifier inside the terminal. I was copying the image paths from the terminal and didn't even realize there was a space after the path.
And going back and looking at where the space came from, it was an extra space typed in casually while the dataset was being processed.
This problem is not difficult to find, but due to the first space is not displayed in the terminal, so did not find this space, and the second want to breakpoint debugging due to the program in the thread pool breakpoints do not take effect, so can not breakpoint debugging variables, can not find this space.
It took me most of a day of tossing and turning before some coincidences finally resolved the issue. A programmer's daily routine is to fight with all kinds of bugs.