Can Anyone help me with screen control using gemini , and langchain(optional)?

Aayush_Kumar · December 24, 2024, 1:16pm

Hello developers ,
I wonder how can I control screen functionality using gemini api. please give me suggestions.
Let me give you an example :
Have you heard about jayu which was the winner app in google gemini developer’s competition , I has a section which could analyse screen , search the element correctly and do the intended action on that.

I tried with the following steps :

Take screenshot using pyautogui ,
then send the screenshot image to gemini pro model to get the coordinates of the element (for example: search box)then move the cursor using pyautogui .
then write or click the element using pyautogui .

but in my case , gemini pro model do not give correct coordinates of the element .
Can any experience developer suggest me how can I approach and solve this problem .

Rock_Perry · December 24, 2024, 6:58pm

Based on what you described plus a bit of my own suggestively creative ideas, I generated the following installation workaround.
Sure! Let me make the instructions simpler and break them down into very basic steps. Here’s an easy guide for beginners:

Step 1: What You Need

You need two computers:

Computer 1: Will run TensorFlow, a program that helps analyze images.
Computer 2: Will run a Virtual Machine (VM). Think of it as a fake computer inside your real computer.

You’ll also need:

Software for TensorFlow: Install it on Computer 1.
Software for Virtual Machines: Use programs like VMware or VirtualBox on Computer 2.

Step 2: Set Things Up

On Computer 1 (TensorFlow Machine):

Install TensorFlow:
- Download and install TensorFlow. Use this command in the terminal or command prompt:
```
pip install tensorflow
```
Prepare TensorFlow:
- Download or create a program that can look at pictures (like screenshots) and find buttons or text boxes on the screen.

On Computer 2 (Virtual Machine Host):

Set Up a Virtual Machine (VM):
- Install VMware or VirtualBox.
- Create a new VM and install an operating system (like Windows).
- Make sure the VM is running and can connect to Computer 1.
Enable Screen Recording:
- Install a screen recorder like OBS Studio.
- Set it up to record what happens on the VM screen.

Step 3: Teach the Computer to Find Buttons

Take a Picture of the Screen:
- Use a tool like pyautogui in Python to take a screenshot of the VM.
```
import pyautogui
screenshot = pyautogui.screenshot()
screenshot.save("screenshot.png")
```

Send the Picture to an AI Tool (Gemini):

Use the Gemini API (an online tool) to find where the buttons or text boxes are in the picture.

Example code:

import requests
with open("screenshot.png", "rb") as img:
    response = requests.post("https://gemini.api.url/analyze", files={"file": img})
coordinates = response.json().get("coordinates")

Check if the AI is Correct:
- The AI will give you numbers (like X and Y coordinates) showing where the button is on the screen.
- If it’s wrong, try to use a clearer picture or retrain the AI with better examples.

Step 4: Make the Computer Click Buttons

Move the Mouse to the Button:

Use the coordinates from the AI to move the mouse.

import pyautogui
x, y = coordinates.get("search_box")
pyautogui.moveTo(x, y)
pyautogui.click()

Type Text or Click More Buttons:
- After clicking, you can tell the computer to type or interact with the element.
```
pyautogui.write("Hello World!")
```

Step 5: Improve Accuracy

Clearer Pictures Help:
- Make sure your screenshots are clear and don’t have too much extra stuff, like toolbars or pop-ups.
Teach the AI Better:
- If the AI keeps getting it wrong, give it better examples. You can use pictures of screens with buttons and show it where the buttons are.
Test Everything:
- Try it on simple apps first to see if it works. Fix mistakes as you go.

Step 6: Put It All Together

Connect the Two Computers:
- Use Computer 2 to run the VM and record the screen.
- Send the screenshots to Computer 1 for analysis.
Automate Everything:
- Write a Python script that takes screenshots, sends them to the AI, and clicks the buttons automatically.

Final Tips

Keep Testing: Start small and fix problems step by step.
Ask for Help: If something doesn’t work, ask someone who knows Python or AI.
Be Patient: It might take time to get everything working smoothly.

If you need help with a specific part, let me know!

Aayush_Kumar · December 25, 2024, 8:46am

Thank you so much for this beautiful step by step guide. Also I want to add something :

<-----

Get a bounding box for an object

Gemini models are trained to return bounding box coordinates as relative widths or heights in the range of [0, 1]. These values are then scaled by 1000 and converted to integers. Effectively, the coordinates represent the bounding box on a 1000x1000 pixel version of the image. Therefore, you’ll need to convert these coordinates back to the dimensions of your original image to accurately map the bounding boxes.

# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-pro")

prompt = "Return a bounding box for each of the objects in this image in [ymin, xmin, ymax, xmax] format."
response = model.generate_content([sample_file_1, prompt])

print(response.text)

The model returns bounding box coordinates in the format [ymin, xmin, ymax, xmax]. To convert these normalized coordinates to the pixel coordinates of your original image, follow these steps:

Divide each output coordinate by 1000.
Multiply the x-coordinates by the original image width.
Multiply the y-coordinates by the original image height.

---->>>

I got this on official documentation and I got to know why I was getting the incorrect coordinates .

However your idea of using virtual machine in parallel is a good one . I will try to implement it.
Thanks

Topic		Replies	Views
Issues with the Accuracy of Object Coordinates Detected by Gemini 1.5 in Images Gemini API gemini-15	6	178	June 10, 2024
Empowering Users with "Personal Gemini": A User-Centric AI Assistant Integrated with Google Cloud Gemini API learning , ai-studio , api , models , datasets	6	388	January 5, 2025
Seeking feedback on the competition criteria (while using JAYU as scapegoat) Gemini API Developer Competition	22	770	December 2, 2024
Here is example of FunctionCalling Project:Nice integration without nasty Function declarations! Gemini API api , gemini-api	4	96	October 28, 2024
Simple smartbot to build upon Gemini API	0	65	August 1, 2024