Training With Probabilities

One of my output layers corresponds to a probability. Some of the loss components depend on realizations conditional on this probability. I have continued to pair my model further and further down but still there is a lack of variation in my results that is concerning. I was originally using pytorch and posted on their forum and got no response. So I switched to tf. In pytorch it’s real simple to just do torch.bernouill(vector of probabilities). In tf, here’s what I’m currently doing (“a” is the vector of probabilities, which is created by applying sigmoid to that layer)

dist = tfd.Bernoulli(probs=a)
    samples = dist.sample()
    aa = tf.cast(samples, dtype=a.dtype)

Was wondering if there are any other alternatives to getting realizations that will make training more successful. Here’s some minimal code to describe more completely what I’m doing. I stripped out everything unrelated to the network structure and the probability object (called “a”)

# construction of neural network
layers = [
    tf.keras.layers.Dense(32, activation='relu', input_dim=1, bias_initializer='he_uniform'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(7, activation=tf.keras.activations.linear)
]
perceptron = tf.keras.Sequential(layers)

def dr(w: Vector) -> Tuple[Vector, Vector, Vector, Vector, Vector, Vector, Vector]:

    # Normalize income to be between -1 and 1
    w = (w - wmin) / (wmax - wmin) * 2.0 - 1.0
    
    s = tf.concat([_e[:,None] for _e in [w]], axis=1)
    # Perceptron output
    x = perceptron(s)  # n x 7 matrix

  ..
    a = tf.sigmoid(x[:, 2])        # Always positive
...

    return .. a, ..

import tensorflow_probability as tfp
tfd = tfp.distributions
def Residuals(e: Vector, w: Vector):

    # all inputs are expected to have the same size n
    n = tf.size(w)

    # arguments correspond to the values of the states today
    X0, Π0, a, λ1, λ2, λ30, λ31 = dr(w)
    dist = tfd.Bernoulli(probs=a)
    samples = dist.sample()
    aa = tf.cast(samples, dtype=a.dtype)

    #...
    R1 = f(aa, gap_next,...)


    return (R1, ..)

def Ξ(n): # objective function for DL training

    w = tf.random.uniform(shape=(n,), minval=-.1, maxval=.1)

    e1 = tf.random.normal(shape=(n,), stddev=1)
    e2 = tf.random.normal(shape=(n,), stddev=1)
 

    R1_e1, .. = Residuals(e2,  w)
    
    R1_e2, ...= Residuals(e2,w)
    

    R_squared = R1_e1*R1_e2 +...
    # compute average across n random draws
    return tf.reduce_mean(R_squared)

n = 128
v = Ξ(n)
θ = perceptron.trainable_variables
variables = perceptron.trainable_variables
optimizer = Adam()
@tf.function
def training_step():

    with tf.GradientTape() as tape:
        xx = Ξ(n)

    grads = tape.gradient(xx, θ)
    optimizer.apply_gradients(zip(grads,θ))

    return xx

def train_me(K):

    vals = []
    for k in tqdm(tf.range(K), leave=False):
        val = training_step()
        vals.append(val.numpy())
        
    return vals