Android web casting controls from start to quit

contexts

The business needs to capture the entire process of performing a task on an app, the original solution was relatively complex, and modifications required coordinating with multiple parties, thus considering whether there was a more lightweight solution.

Original demand:

Record every step in completing a task (clicking, swiping, typing, etc.)
Record screenshots and layout xml before and after the operation

Adb-based programs

The easiest option to consider is to go through the adb to realize, to get the current page xml, the current page screenshot, so just need to send each step of the operation through the adb to the cell phone can be.

move

Connect to the device via adb, write an agent program to receive requests for web page operations, and send commands via adb to execute them.
adb get current page xml (uiautomator dump)
adb gets a screenshot of the current page (screencap) and agent sends it to the web page via ws.
Web page displays an image, monitors mouse click events and calculates the click position
Send the relevant operation to the device via adb to simulate the operation
Cycle steps 2-5

Figure out the process, you can directly tell the programming LLM, the code in seconds, taking into account the golang dependency is less, we directly let the LLM to generate golang code.

The following describes part of the implementation, such as golang calling adb, passing deviceid and operations on the web side:


func executeCommand(deviceID string, action string, parameters string) error {
	cmdArgs := []string{"-s", deviceID, "shell", action}
	if parameters != "" {
		cmdArgs = append(cmdArgs, parameters)
	}
	(cmdArgs)
	cmd := ("adb", cmdArgs...)
	err := ()
	if err != nil {
		return err
	}
	return nil
}

For example, to take a screenshot, call screencap to capture an image in png format:

func screenshot(deviceID string) ([]byte, error) {
	cmd := ("adb", "-s", deviceID, "exec-out", "screencap", "-p")
	var out 
	 = &out
	err := ()
	if err != nil {
		return nil, err
	}
	return (), nil
}

Displayed on the JavaScript side:

 = (event) => {
    if ( instanceof Blob) {                   
        const url = ();
         = url;
    }
}

A div layer can be added above the image to monitor mouse events and simulate actions:

('mousedown', (e) => {
    startX = ;
    startY = ;
    startTime = ();
});

('mouseup', (e) => {
    const endX = ;
    const endY = ;
    const elapsedTime = () - startTime;
    const duration = (elapsedTime / 1000, 0.001); // Avoid zero division

    const imgStartX = (startX / imgDisplayWidth) * imgWidth;
    const imgStartY = (startY / imgDisplayHeight) * imgHeight;
    const imgEndX = (endX / imgDisplayWidth) * imgWidth;
    const imgEndY = (endY / imgDisplayHeight) * imgHeight;

    if ((imgStartX - imgEndX) > 5 || (imgStartY - imgEndY) > 5) {
        sendCommand('input swipe', `${imgStartX} ${imgStartY} ${imgEndX} ${imgEndY}`);
    }
    else if (duration > 500) {
        // long press
        sendCommand('input swipe', `${imgStartX} ${imgStartY} ${imgEndX} ${imgEndY} ${duration / 1000}`);
    }
    else {
        sendCommand('input tap', `${imgStartX} ${imgStartY}`);
    }
});

Effects and problems

The effect is as follows:

There are a lot of problems too:

The screencap is slow, it takes 600~700ms to test the simulator and the display feels laggy.
Most of the time, the page is not operated, the image basically does not change, repeated transmission waste of network
uiautomator dump is even more exaggerated, 2~3s

make superior

Differential transmission of images, after the screenshot to check whether the change, no change will not be sent, there is a change to send diff image, so that the JavaScript side of the merged image can be.

diff Using the simplest strategy, the same ones are changed to full transparency, and the different ones are kept in the original image to compute the diff map:

// CalculateDifference calculates the difference between two RGBA images, and returns the new RGBA image.
// If the two images are identical, return the fully transparent image
func CalculateDifference(img1, img2 ) * {
bounds := ()
diff := (bounds)

for y := ; y < ; y++ {
for x := ; x < ; x++ {
c1 := (x, y). ()
c2 := (x, y). ()

if c1 == c2 {
(x, y, {}) // Set to all zeros for perfect agreement.
continue
} else {
(x, y, c2)
}
// Combine the RGB and Alpha channels into a single 16-bit grayscale value (it may be more practical to store the Alpha channels separately)
}
}

return diff
}

Javascrit received can be combined with the last image to restore, the front-end can use canvas to manipulate the diff image to merge:


function createImageFromBlob(blob) {
    return new Promise((resolve, reject) => {
            const img = new Image();
             = () => resolve(img);
             = reject;
             = (blob);
        });
}

async function restoreImage(diffImageBlob) {
    const refImage = imgElement;
    const diffImage = await createImageFromBlob(diffImageBlob);

     = ;
     = ;

    (refImage, 0, 0);
    const refImageData = (0, 0, , );

    (diffImage, 0, 0);
    const diffImageData = (0, 0, , );

    const resultImageData = (, );
    const refData = ;
    const diffData = ;
    const resultData = ;

    for (let i = 0; i < ; i += 4) {
        // Assuming diff is non-zero means it contains the correct pixel
        resultData[i] = diffData[i] !== 0 ? diffData[i] : refData[i];       // R
        resultData[i + 1] = diffData[i + 1] !== 0 ? diffData[i + 1] : refData[i + 1]; // G
        resultData[i + 2] = diffData[i + 2] !== 0 ? diffData[i + 2] : refData[i + 2]; // B
        resultData[i + 3] = diffData[i + 3] !== 0 ? diffData[i + 3] : refData[i + 3]; // A
    }

    (resultImageData, 0, 0);
}

reach a verdict

Although the idea is feasible, but because adb screenshot and get xml is relatively slow, the final program can not be used, can only change another idea to solve.

Programs based on uiautomator2

uiautomator2 is a python library, use python to call the uiautomator service on the device to get the page information, control the device, the principle is also relatively simple, is through adb on the device to start atxagent and server and other programs, and then through the http and ws to connect to the device so as to realize the control.

The uiautomator2 can fetch xml in a few tens of ms, and the screenshots also provide higher fps because of the efficient minicap.

With uiautomator2 you need to convert golang to python, luckily you can just throw it to LLM, convert it to python first, and then have the ones that use adb change to use uiautomator2, which is basically pretty much the same thing, with a little bit of stitching to get it done.
Here to lament, LLM programmers really good assistant, good program design, thrown to the LLM can be better to achieve, the heart of the matter. (PS: chest has a hill and dale, LLM is a good assistant)

Talking about some changes, executing commands, could provide a more generalized method for the front-end to call directly:

async def execute_command(device_id, action, parameters):
    try:
        device = get_device(device_id)
        command = getattr(device, action)
        if parameters:
            command(**parameters)
        else:
            command()
    except Exception as e:
        print(f"Error executing command on device {device_id}: {e}")

JS Calls

sendCommand('click', {"x": imgStartX, "y": imgStartY});

async function sendCommand(action, parameters) {       
    const command = ({ type: 'action', deviceID: deviceID, action: action, parameters: parameters });
    (command);
    ('Command:', command);
}

make superior

Simulates a click button and sends a key event:

// SendKeyEvent Press a key.（Character or Function Keys）
func (d *Driver) SendKeyEvent(keyCode string) error {
	cmd := ("adb", , "shell", "input", "keyevent", keyCode)
	err := ()
	if err != nil {
		return err
	}
	return nil
}

Solutions for quasi-real-time screen casting

The minicap used above, screenshot is already very fast, a few images a second, basically enough to meet this scenario. Is there a more quasi-real-time solution?

Professional open source screen casting control softwarescrpy It's a good choice.scrpyThe implementation principle is actually similar to the above uiautomator2, will start a server on the device, through the server to get the audio and video stream, as well as control.
scrpy It is relatively more technically mature, while minicap, which uiautomator2 relies on, lacks maintenance and does not support newer versions of Android well enough.

So, in terms of getting screenshots, you can also consider calling thescrpyThe server to realize quasi real-time control. But as the title says, from getting started to giving up, the above program can already meet our needs, there is no need to invest more effort here, so this program is abandoned.

concluding remarks

This article focuses on documenting the practical process related to casting screen control, by starting from adb scheme, to uiautomator2, and finally abandoning the scrpy scheme, it's an interesting process to learn about the knowledge that was not touched in the past in this lively weekend with just the right amount of free time.