My loss just won't go down from the certain range

**Hi guys. **
I’m trying to do a regression task where I find the values of the 5 head in my dataset(PIONA) and then using Inputs with these values to predict each subsets on each P,I,O,N,A groups(P—>CP5,CP6,CP7,CP8, I—>CI5,CI6,CI7,CI8, same for the other groups)..I have two outputs the first one gives me the values of P,I,O,N,A(sum = 1) and the second one gives me the subsets values(CP5,CP6,…,CA7,A8)(20 subsets and their sum should also equal to 1)..This is my model:

inp = keras.layers.Input(shape=X_train.shape[1:])
x = keras.layers.Activation(‘swish’)(inp)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dense(64, use_bias = False, kernel_initializer = ‘he_normal’)(x)
x = keras.layers.Activation(‘swish’)(x)
x = keras.layers.BatchNormalization()(x)
x = keras.layers.Dense(5, use_bias = False, kernel_initializer = ‘he_normal’)(x)
out1 = keras.layers.Dense(5, activation=‘softmax’, name=‘PIONA’)(x)
y = keras.layers.concatenate([inp, x])
y = keras.layers.Activation(‘swish’)(y)
y = keras.layers.BatchNormalization()(y)
y = keras.layers.Dense(64, use_bias = False, kernel_initializer = ‘he_normal’)(y)
y = keras.layers.Activation(‘swish’)(y)
y = keras.layers.BatchNormalization()(y)
out2 = keras.layers.Dense(20, activation=‘softmax’, name = ‘subsets’)(y)
model = keras.Model(inputs = inp, outputs =[out1, out2])

I have also used cosineDecayRestart as learning scheduling:

lr_schedule = tf.keras.optimizers.schedules.CosineDecayRestarts(
initial_learning_rate=1e-2,
first_decay_steps=500,
t_mul=2.0,
m_mul=0.9,
alpha=1e-6
)
and Huber as my loss:
model.compile(
optimizer=keras.optimizers.AdamW(learning_rate=lr_schedule),
loss=[Huber, Huber],
metrics=[“mae”, ‘mae’]
)
My problem is with this network loss and val loss just won’t go down from the certain range. what am I doing wrong??

Also average gradient on training steps look like this:

Anybody knows what is wrong? is this data problem?? or my neural problem??

Sorry for sharing pics in google drive..I couldn’t upload the pics because I’m a new user

Could the cosine decay learning rate schedule be causing the cycles (sudden val loss increases due to overfitting followed by gradual declines)?