Can Anyone help me with screen control using gemini , and langchain(optional)?

Hello developers ,
I wonder how can I control screen functionality using gemini api. please give me suggestions.
Let me give you an example :
Have you heard about jayu which was the winner app in google gemini developer’s competition , I has a section which could analyse screen , search the element correctly and do the intended action on that.

I tried with the following steps :

  1. Take screenshot using pyautogui ,
  2. then send the screenshot image to gemini pro model to get the coordinates of the element (for example: search box)then move the cursor using pyautogui .
  3. then write or click the element using pyautogui .

but in my case , gemini pro model do not give correct coordinates of the element .
Can any experience developer suggest me how can I approach and solve this problem .

Based on what you described plus a bit of my own suggestively creative ideas, I generated the following installation workaround.
Sure! Let me make the instructions simpler and break them down into very basic steps. Here’s an easy guide for beginners:


Step 1: What You Need

You need two computers:

  1. Computer 1: Will run TensorFlow, a program that helps analyze images.
  2. Computer 2: Will run a Virtual Machine (VM). Think of it as a fake computer inside your real computer.

You’ll also need:

  • Software for TensorFlow: Install it on Computer 1.
  • Software for Virtual Machines: Use programs like VMware or VirtualBox on Computer 2.

Step 2: Set Things Up

On Computer 1 (TensorFlow Machine):

  1. Install TensorFlow:
    • Download and install TensorFlow. Use this command in the terminal or command prompt:
      pip install tensorflow
      
  2. Prepare TensorFlow:
    • Download or create a program that can look at pictures (like screenshots) and find buttons or text boxes on the screen.

On Computer 2 (Virtual Machine Host):

  1. Set Up a Virtual Machine (VM):

    • Install VMware or VirtualBox.
    • Create a new VM and install an operating system (like Windows).
    • Make sure the VM is running and can connect to Computer 1.
  2. Enable Screen Recording:

    • Install a screen recorder like OBS Studio.
    • Set it up to record what happens on the VM screen.

Step 3: Teach the Computer to Find Buttons

  1. Take a Picture of the Screen:

    • Use a tool like pyautogui in Python to take a screenshot of the VM.
      import pyautogui
      screenshot = pyautogui.screenshot()
      screenshot.save("screenshot.png")
      
  2. Send the Picture to an AI Tool (Gemini):

    • Use the Gemini API (an online tool) to find where the buttons or text boxes are in the picture.
    • Example code:
      import requests
      with open("screenshot.png", "rb") as img:
          response = requests.post("https://gemini.api.url/analyze", files={"file": img})
      coordinates = response.json().get("coordinates")
      
  3. Check if the AI is Correct:

    • The AI will give you numbers (like X and Y coordinates) showing where the button is on the screen.
    • If it’s wrong, try to use a clearer picture or retrain the AI with better examples.

Step 4: Make the Computer Click Buttons

  1. Move the Mouse to the Button:

    • Use the coordinates from the AI to move the mouse.
      import pyautogui
      x, y = coordinates.get("search_box")
      pyautogui.moveTo(x, y)
      pyautogui.click()
      
  2. Type Text or Click More Buttons:

    • After clicking, you can tell the computer to type or interact with the element.
      pyautogui.write("Hello World!")
      

Step 5: Improve Accuracy

  1. Clearer Pictures Help:

    • Make sure your screenshots are clear and don’t have too much extra stuff, like toolbars or pop-ups.
  2. Teach the AI Better:

    • If the AI keeps getting it wrong, give it better examples. You can use pictures of screens with buttons and show it where the buttons are.
  3. Test Everything:

    • Try it on simple apps first to see if it works. Fix mistakes as you go.

Step 6: Put It All Together

  1. Connect the Two Computers:

    • Use Computer 2 to run the VM and record the screen.
    • Send the screenshots to Computer 1 for analysis.
  2. Automate Everything:

    • Write a Python script that takes screenshots, sends them to the AI, and clicks the buttons automatically.

Final Tips

  • Keep Testing: Start small and fix problems step by step.
  • Ask for Help: If something doesn’t work, ask someone who knows Python or AI.
  • Be Patient: It might take time to get everything working smoothly.

If you need help with a specific part, let me know!

1 Like

Thank you so much for this beautiful step by step guide. Also I want to add something :

<-----

Get a bounding box for an object

Gemini models are trained to return bounding box coordinates as relative widths or heights in the range of [0, 1]. These values are then scaled by 1000 and converted to integers. Effectively, the coordinates represent the bounding box on a 1000x1000 pixel version of the image. Therefore, you’ll need to convert these coordinates back to the dimensions of your original image to accurately map the bounding boxes.

# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-pro")

prompt = "Return a bounding box for each of the objects in this image in [ymin, xmin, ymax, xmax] format."
response = model.generate_content([sample_file_1, prompt])

print(response.text)

The model returns bounding box coordinates in the format [ymin, xmin, ymax, xmax]. To convert these normalized coordinates to the pixel coordinates of your original image, follow these steps:

  1. Divide each output coordinate by 1000.
  2. Multiply the x-coordinates by the original image width.
  3. Multiply the y-coordinates by the original image height.

---->>>

I got this on official documentation and I got to know why I was getting the incorrect coordinates .

However your idea of using virtual machine in parallel is a good one . I will try to implement it.
Thanks