Train model to pick up biggest number

Hi there, I’m making my first steps in ML. I’m absolutely new to this world, I just want to start understanding an make my first steps. My goal is to train a model to play a card game, so I started defining a good state definition, and, with the help of GPT I’ve set up a minimal working propotype. After a few trials, my perception is that the model is not learning. I paste here the relevant parts, so maybe some of you could focus me on the right direction.

First I create the model and define it’s metaparameters with:

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

# Adjust the create_model function
def create_model():
    input_shape = (20,)
    num_actions = 3
    model = Sequential()
    model.add(Dense(128, input_shape=input_shape, activation='relu'))  # Adjust input_shape and add more layers if needed
    model.add(Dense(num_actions, activation='softmax'))  # Adjust the output layer to match the number of actions

    # Compile the model with categorical crossentropy loss
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

    return model

I generate a set of training states, training_actions and training_rewards.

each state is an array of 20 integers. Action is a value from 0 to 2 (only 3 possible actions) and reward is either 0 or 1. The system gives a reward if the model chooses the bigger integer in positions 0, 1 or 2 of the state.

Action for each state is predicted with:

action_probabilities = model.predict(np.expand_dims(state, axis=0),verbose = 0)[0]

I then normalize the probabilities and choose the action predicted with biggest value, after getting predicted action, I calculate the reward. If action predicted is the greatest value from state[0], state[1] or state[2] then reward=1 else reward=0

For example:

Predicted Actions: [2, 2, 0]
Input States: [[4, 3, 2, -1, -1, -1, 8, -1, -1, 1, -1, -1, 5, -1, -1, 0, 1, 0, 1, 0], [1, 8, 9, -1, -1, -1, 11, -1, -1, 3, -1, -1, 13, -1, -1, 0, 1, 0, 1, 0], [8, 2, 15, -1, -1, -1, 1, -1, -1, 1, -1, -1, 2, -1, -1, 0, 1, 0, 1, 0]]
Rewards: [0, 1, 0]

As you see, the goal is that the predicted actions were: [0,2,2] because greatest numbers of the first 3 elements of each state array are, respectively, 0 (value 4 in first state), 2 (value 9 in second array, well predicted) and 2 (value 15 in third array)

After hundreds of thousands of data feed to the model, it’s unable to predict good actions.

Functions to train the model are those:

def train_model(model, states, actions, rewards):
    # Convert training data to NumPy arrays
    # Log the data passed to train_model to a file
    log(f"Actions: {actions}")
    log(f"States: {states}")
    log(f"Rewards: {rewards}")

    if states:
        X_train = np.vstack(states)

        action_indices = np.array(actions)
        rewards = np.array(rewards)

        # Calculate Q-values for the chosen actions
        q_values = calculate_q_values(len(states), action_indices, rewards)

        # Train the model
        model.fit(X_train, q_values, epochs=20, verbose=0)

def calculate_q_values(num_states, action_indices, rewards):
    # In this simplified case, we assume the rewards apply to the chosen actions
    q_values = np.zeros((num_states, len(config['commands'])))

    for i in range(len(action_indices)):
        action_index = action_indices[i]
        reward = rewards[i]
        q_values[i, action_index] = reward  # Update Q-value for the chosen action

    return q_values

I really don’t know where to start to optimize the model, any help would be very much appreciated.

Many thanks in advance,
Jaume.

I am by no means an expert on reinforcement learning, but are you sure

        model.fit(X_train, q_values, epochs=20, verbose=0)

is what you actually want to do? As I understand your post, you want to use the model for simply selecting the largest number out of a vector of values, right? So you want the model to learn the max function. Then you should not use the q_values here as the target data, right?

To improve your model, consider the following steps:

  1. Model Complexity: Increase the complexity of your model by adding more layers or units if necessary.
  2. Training Data: Ensure your training data is diverse and covers various scenarios, with a balanced distribution of rewards.
  3. Reinforcement Learning: Shift from a supervised learning approach to a reinforcement learning algorithm like Q-learning or Deep Q-Network (DQN), which are more suitable for tasks involving delayed rewards.
  4. Loss Function: In a reinforcement learning context, use loss functions that account for future rewards, such as mean squared error between predicted and target Q-values.
  5. Evaluation: Regularly evaluate your model’s performance in the game context, not just based on immediate rewards.
  6. Exploration vs. Exploitation: Implement a strategy to balance exploration of new actions and exploitation of known rewarding actions.
  7. State Representation: Review and possibly enhance your state representation to ensure it captures all necessary information for decision-making.

Transitioning to a reinforcement learning framework and adjusting your approach to training and evaluation should help your model learn more effectively.