random access memories

Decoding BarbarousKing's Battle with Bowser in Kaizo Mario World 3

Introduction

People do stupid things all the time, BarbarousKing took it upon himself to defeat the final boss of Kaizo Mario World 3 hit-less. Meaning, he would not take the mushroom power ups prior to going into the fight. Something that BarbarousKing thought would take a few hours ended up taking him just shy of 27 hours.

Kaizo Mario is a series of fan-made, custom levels for the Super Mario World game on the Super Nintendo, known for their extreme difficulty. These levels are intentionally designed to be incredibly challenging, often featuring intricate and precise jumps, hidden traps, and obstacles that require near-perfect timing and expert control to navigate.

After watching some of attempts I started to wonder how many attempts that it would take him. He wasn't using a death counter and chat wasn't tracking it for him. I figured, it would probably be possible to cook something up in python to determine the number attempts in an automated way and after a little research and finding the opencv library I thought I'd give it a try. Let's explore the tools and the though process for how I determined about how many attempts were made.

Downloading VODs

Before I began I need to get the videos from twitch. I knew about a tool for downloading YouTube videos, yt-dlp, and lucky enough for me that also worked for the Twitch VODs (Video On Demand). I found seven VODs of Barb attempting Bowser, I grabbed the links to those and placed it in a file:

https://www.twitch.tv/videos/1966545586
https://www.twitch.tv/videos/1967415642
https://www.twitch.tv/videos/1969476402
https://www.twitch.tv/videos/1970200938
https://www.twitch.tv/videos/1970651620
https://www.twitch.tv/videos/1971009244
https://www.twitch.tv/videos/1971858038

Then I used yt-dlp to download the VODs:

yt-dlp -f 'best[height<=160]' -o '%(autonumber)s-%(upload_date)s-%(id)s-%(height)s.%(ext)s' -a videos.txt

I decided to go with the smallest video size, 160p, as I could to reduce the size on disk and time it would take me to download the VODs. 1

Finding the Key

My first thought on determining an attempt was to just count the number of times I saw "MARIO START !"

random access memories

There were two issues with this. First, there is a bunch of data we didn't need; the left portion of the screen is just Barb and chat. Secondly, "MARIO START !" appears each time Barb enters the level. He had to get to Bowser first and he died a lot getting to the checkpoint before Bowser.

My next thought was to key on the room before Bowser.

random access memories

This is a good place to key off. I figured I couldn't just key off the room itself, so I decided to key off just the portrait of Bowser. But, how do you crop the video to just the portrait? ffmpeg was the answer, but I needed some information prior. I need to know the horizontal distance to the left of the portrait, the vertical distance to the top of the portrait and the dimensions of the portrait.

Using Preview on macOS I was able to determine all the pieces of information to use the crop filter that ffmpeg has built in.

  1. The dimensions of the video is: 284x160 px.
  2. The distance from the left of the frame to the left side of the portrait is 162 px.
  3. The distance from the top of the frame to the top of the portrait is 58 px
  4. The dimensions of the portrait are 48x51 px

random access memories

With all this information in had I went ahead and cropped the video.

for file in *.mp4; do ffmpeg -i $file -filter:v "crop=51:48:162:58" -c:a copy -an ${file%.*}-cropped.mp4; done

The cropped video gives an image like this:

random access memories

The Code

Below is a portion code I wrote to count how many attempts it took Barb to kill bowser. The full code, including debug code is available at github.

The first thing to do was to figure out how to determine if two images are similar. That was easy enough using scikit-image:

from skimage.metrics import structural_similarity as ssim

def compare_images(imageA, imageB):
    # Compute SSIM between two images
    return ssim(imageA, imageB)

Next was to actually read the video. From what I was able to read online I didn't need full color video. I didn't strip it out when I cropped the video, so I just used opencv to do that.

In the main method I had to read iterate over the frames until there were not more frames and compare each frame to the portrait screenshot. I am not going to go ever each line as the code is commented. But, I wanted to say how easy opencv was to work with.

import cv2

def main(video_path, screenshot_path):
    # Read the screenshot
    screenshot = cv2.imread(screenshot_path)
    screenshot = cv2.cvtColor(screenshot, cv2.COLOR_BGR2GRAY)

    # Open the video file
    cap = cv2.VideoCapture(video_path)

    if not cap.isOpened():
        print("Error: Could not open video.")
        return

    frame_count = 0
    similar_frames = []
    match_found = False

    # Loop over the frames of the video
    while True:
        ret, frame = cap.read()

        # If the frame isn't read correctly, break out.
        if not ret:
            print("Can't read frame (stream end?). Exiting ...")
            break

        frame_count += 1

        # Convert the frame to grayscale
        gray_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)

        # Compare the current frame to the screenshot
        similarity = compare_images(screenshot, gray_frame)

        # If the similarity is high enough, record the frame number and percent similar
        # Set the match_found to true, reset that once our similarity drops
        if similarity > 0.8:
            if not match_found:
                match_found = True
                similar_frames.append((frame_count, similarity))
        else:
            match_found = False
            
    # When everything done, release the capture
    cap.release()

    high = -1
    low = 101
    if similar_frames:
        for frame in similar_frames:
            if frame[1] < low:
                low = frame[1]
            if frame[1] > high:
                high = frame[1]
            print(f"Similar frames found at: {frame[0]}, {frame[1]} similar.")
        print(f'There were {len(similar_frames)} attempts.')
        print(f'low: {low}, high: {high}')
    else:
        print("No similar frames found.")

In the output of the program I wanted to determine the similarity low and high for each found frame.

...
Similar frames found at: 3423922, 0.8049449456081211 similar.
Similar frames found at: 3425877, 0.8174777870183221 similar.
Similar frames found at: 3425936, 0.8064994150709818 similar.
Similar frames found at: 3427873, 0.8126211193541708 similar.
Similar frames found at: 3427933, 0.8089345290975335 similar.
There were 2299 attempts.
low: 0.8000030858825345, high: 0.9803608503421268

Conclusion

There you have it, it took Barb about 2,299 attempts in nearly 27 hours to kill Bowser in Kaizo Mario World 3. I think this number is accurate into about +/- 300 attempts. I had a great time learning more about tools I've have used in the past, but not in depth. A big shout out to yt-dlp, ffmpeg, opencv, and python maintainers. It is a lot of work and your tools rock.

In my next iteration I want to make some changes.

  1. Use higher quality videos
  2. Keep track of how long each actual attempt takes
  3. Use the boss room to key off of, instead of the portrait.

Finally, here is Barb in his moment of glory.

  1. I think this might be a mistake and accurately determining the count in retrospect. I think using better a resolution video would ensure that matching would work better. As I write this post I am downloading the 720p VODs to test my theory.