Hello developers ,
I wonder how can I control screen functionality using gemini api. please give me suggestions.
Let me give you an example :
Have you heard about jayu which was the winner app in google gemini developer’s competition , I has a section which could analyse screen , search the element correctly and do the intended action on that.
I tried with the following steps :
- Take screenshot using pyautogui ,
- then send the screenshot image to gemini pro model to get the coordinates of the element (for example: search box)then move the cursor using pyautogui .
- then write or click the element using pyautogui .
but in my case , gemini pro model do not give correct coordinates of the element .
Can any experience developer suggest me how can I approach and solve this problem .
Based on what you described plus a bit of my own suggestively creative ideas, I generated the following installation workaround.
Sure! Let me make the instructions simpler and break them down into very basic steps. Here’s an easy guide for beginners:
Step 1: What You Need
You need two computers:
- Computer 1: Will run TensorFlow, a program that helps analyze images.
- Computer 2: Will run a Virtual Machine (VM). Think of it as a fake computer inside your real computer.
You’ll also need:
- Software for TensorFlow: Install it on Computer 1.
- Software for Virtual Machines: Use programs like VMware or VirtualBox on Computer 2.
Step 2: Set Things Up
On Computer 1 (TensorFlow Machine):
- Install TensorFlow:
- Download and install TensorFlow. Use this command in the terminal or command prompt:
pip install tensorflow
- Prepare TensorFlow:
- Download or create a program that can look at pictures (like screenshots) and find buttons or text boxes on the screen.
On Computer 2 (Virtual Machine Host):
-
Set Up a Virtual Machine (VM):
- Install VMware or VirtualBox.
- Create a new VM and install an operating system (like Windows).
- Make sure the VM is running and can connect to Computer 1.
-
Enable Screen Recording:
- Install a screen recorder like OBS Studio.
- Set it up to record what happens on the VM screen.
Step 3: Teach the Computer to Find Buttons
-
Take a Picture of the Screen:
-
Send the Picture to an AI Tool (Gemini):
-
Check if the AI is Correct:
- The AI will give you numbers (like X and Y coordinates) showing where the button is on the screen.
- If it’s wrong, try to use a clearer picture or retrain the AI with better examples.
Step 4: Make the Computer Click Buttons
-
Move the Mouse to the Button:
-
Type Text or Click More Buttons:
Step 5: Improve Accuracy
-
Clearer Pictures Help:
- Make sure your screenshots are clear and don’t have too much extra stuff, like toolbars or pop-ups.
-
Teach the AI Better:
- If the AI keeps getting it wrong, give it better examples. You can use pictures of screens with buttons and show it where the buttons are.
-
Test Everything:
- Try it on simple apps first to see if it works. Fix mistakes as you go.
Step 6: Put It All Together
-
Connect the Two Computers:
- Use Computer 2 to run the VM and record the screen.
- Send the screenshots to Computer 1 for analysis.
-
Automate Everything:
- Write a Python script that takes screenshots, sends them to the AI, and clicks the buttons automatically.
Final Tips
- Keep Testing: Start small and fix problems step by step.
- Ask for Help: If something doesn’t work, ask someone who knows Python or AI.
- Be Patient: It might take time to get everything working smoothly.
If you need help with a specific part, let me know!
1 Like
Thank you so much for this beautiful step by step guide. Also I want to add something :
<-----
Get a bounding box for an object
Gemini models are trained to return bounding box coordinates as relative widths or heights in the range of [0, 1]. These values are then scaled by 1000 and converted to integers. Effectively, the coordinates represent the bounding box on a 1000x1000 pixel version of the image. Therefore, you’ll need to convert these coordinates back to the dimensions of your original image to accurately map the bounding boxes.
# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-pro")
prompt = "Return a bounding box for each of the objects in this image in [ymin, xmin, ymax, xmax] format."
response = model.generate_content([sample_file_1, prompt])
print(response.text)
The model returns bounding box coordinates in the format [ymin, xmin, ymax, xmax]
. To convert these normalized coordinates to the pixel coordinates of your original image, follow these steps:
- Divide each output coordinate by 1000.
- Multiply the x-coordinates by the original image width.
- Multiply the y-coordinates by the original image height.
---->>>
I got this on official documentation and I got to know why I was getting the incorrect coordinates .
However your idea of using virtual machine in parallel is a good one . I will try to implement it.
Thanks