How to implement a picture answer assistant by yourself

There is a picture shown below:

There is a feature on Kimi that parses the content of a picture to give an answer:

This can be used for scenarios where you take a picture to ask the AI a question, and I had a need for this myself, so I got my hands dirty.

The do-it-yourself implementation is shown below:

So how do you achieve this on your own?

Can be realized by adding an OCR function. Chinese picture text recognition is also OCR effect is better than Baidu open source PaddleOCR, previously introduced PaddleOCR of the .NET binding PaddleSharp, see this article:C# using PaddleOCR for image text recognition。

Before the use of PaddleOCR, I have installed a virtual environment on the computer, because the demand is relatively simple, is the picture for text recognition after the return of the text on the line, so today to play a different, do not use the .NET binding, directly call the Python script is good.

So now the dismantling task is:

How does C# call Python scripts?

So let's try the simplest call first, calling a Python script to output a Hello:

print("Hello")

It is possible to use class to start an external process to run Python scripts.

 string pythonScriptPath = @"D:\Learning Routes\artificial intelligence (AI)\Image Text Recognition\"; // Replace yourPythonScript Path
 string pythonExecutablePath = @"D:\SoftWare\Anaconda\envs\paddle_env\"; // Replace yourPythonInterpreter path
 ProcessStartInfo start = new ProcessStartInfo();
  = pythonExecutablePath;
  =$"{pythonScriptPath}";
  = false;
  = true;
  = true;
  = true;

 using (Process process = (start))
 {
     using ( reader = )
     {
         string result = ();
         (result);
     }

     using ( errorReader = )
     {
         string errors = ();
         if (!(errors))
         {
             ("Errors: " + errors);
         }
     }
 }

The explanation of each property of ProcessStartInfo is as follows:

FileName：
- Meaning: Specifies the name of the program or document to be started.
- Example.pythonExecutablePath is the path to the Python interpreter, such as"C:\path\to\"。
Arguments：
- Meaning: Specifies the command line arguments passed to the program to be started.
- Example.pythonScriptPath is the path to the Python script you want to execute, such as"C:\path\to\"。
UseShellExecute：
- Meaning: Specifies whether to use the operating system shell to start the process. If set tofalse, the process is started directly; if it is set totrueThe process is started via the shell.
- Example: Here, set thefalseThis means that instead of starting the process with a shell, you can start the Python interpreter directly.
RedirectStandardOutput：
- Meaning: Specifies whether to redirect the standard output of the child process to the Streaming.
- Example: Here, set thetrueThis means that the output of the Python script will be redirected to theso that you can read it.
RedirectStandardError：
- Meaning: Specifies whether to redirect the standard error output of a child process to the Streaming.
- Example: Here, set thetrueThis means that the error output of the Python script is redirected to theso that you can read it.
CreateNoWindow：
- Meaning: Specifies whether to start the process in a new window. If set totrue, no new window will be created; if set tofalseIf you do not, a new window will be created.
- Example: Here, set thetrue, which means that no new windows are created, i.e., Python scripts are run in the background.

Now check out the results of the run:

Gets the value output by the Python script.

So breaking down the task a bit more, we need to pass in a parameter on the command line, how do we accomplish that?

import sys

# Check if an argument is passed
if len() > 1.
    n = [1]
    print(f "hello {n}")
else: n = [1] print(f "hello {n}")
    print("Please provide a parameter")

Simply modify the picture below, in these two places:

Now try the effect again:

Successfully passed a parameter on the command line.

So now our preparations are done.

The script for using PaddleOCR is as follows:

import sys
import logging
from paddleocr import PaddleOCR, draw_ocr

# Paddleocr currently supports multiple languages that can be switched by modifying the lang parameter.
# For example, `ch`, `en`, `fr`, `german`, `korean`, `japan`.


# Check if parameters are passed
if len() > 1.
    imagePath = [1]
else: imagePath = [1]: imagePath = [1]
    print("Please provide a parameter")

# Configure the logging level to WARNING so that DEBUG and INFO level log messages will be hidden.
(level=)

# Create a custom log handler that outputs logs to NullHandler (no output)
class NullHandler(): def emit(self, record)
    class NullHandler(): def emit(self, record).
        pass

# Get the logger for PaddleOCR
ppocr_logger = ('ppocr')

# Remove all default log handlers
for handler in ppocr_logger.handlers[:].
    ppocr_logger.removeHandler(handler)

# Add a custom NullHandler
ppocr_logger.addHandler(NullHandler())

ocr = PaddleOCR(use_angle_cls=True, lang="ch") # need to run only once to download and load model into memory
img_path = imagePath
result = (img_path, cls=True)
for idx in range(len(result)): res = result[idx].
    res = result[idx]
    for line in res: print(line[1][0])
        print(line[1][0])

The effect of running it in vs code is shown below:

Now the result of the call in the WPF application is as follows:

Now the image text recognition part is taken care of.

Now it's time to combine it with the big language model, which is to take the recognized text and throw it to the big language model.

It can be written like this:

 public async IAsyncEnumerable<string> GetAIResponse4(string question, string imagePath)
 {
     string pythonScriptPath = @"D:\Learning Routes\artificial intelligence (AI)\Image Text Recognition\"; // Replace yourPythonScript Path
     string pythonExecutablePath = @"D:\SoftWare\Anaconda\envs\paddle_env\"; // Replace yourPythonInterpreter path
     string arguments = imagePath; // Replace it with the parameter you want to pass

     ProcessStartInfo start = new ProcessStartInfo();
      = pythonExecutablePath;
      = $"{pythonScriptPath} {arguments}";
      = false;
      = true;
      = true;
      = true;

     string result = "";

     using (Process process = (start))
     {
         using ( reader = )
         {
             result = ();
         }

         using ( errorReader = )
         {
             string errors = ();
             if (!(errors))
             {
                 ("Errors: " + errors);
             }
         }
     }

     string skPrompt = """
                        Get the content of the image：{{$PictureContent}}。
                        Answer the questions based on the information obtained：{{$Question}}。
                     """;
     await foreach (var str in _kernel.InvokePromptStreamingAsync(skPrompt, new() { ["PictureContent"] = result, ["Question"] = question }))
     {
         yield return ();
     }
 }

Then you can realize the following effect:

The full code can be seen at /Ming-jiayou/SimpleRAG.