GradientBoostedTree not producing valid prediction values

I am brand new to tensorflow and machine learning in general, but I have been able to do a lot more than I originally anticipated (actually getting code to run). However, I have gotten some strange results from my model. My data is stored in a pandas dataframe, and is structured as follows:

        ptype         E        px        py        pz  E1E9  ...      E_L4   dEdxCDC   dEdxFDC    tShower    tTrack  thetac
205971    -11  4.617147  4.685261  1.792256  1.387385   NaN  ...  2.057134  0.000003       NaN   2.464077 -0.021874     NaN
50287     130  0.264139       NaN       NaN       NaN   NaN  ...  0.176349       NaN       NaN   7.092736       NaN     NaN
133619  -2212  0.685756 -0.862369 -0.147122  0.232955   NaN  ...  0.438425  0.000003       NaN   4.046603 -0.232221     NaN
269408   -211  0.290424 -1.381033  4.391991  4.357138   NaN  ...  0.128230  0.000002       NaN   3.223814  0.202712     NaN
124688   2212  1.118175 -1.094253  0.306560  4.897157   NaN  ...  0.488274  0.000002  0.000002  10.575593  0.050699     NaN

I then followed some code I found in a keras tutorial, which I have included below:

trainingData = tfdf.keras.pd_dataframe_to_tf_dataset(trainingDataDF,label='ptype')
testData = tfdf.keras.pd_dataframe_to_tf_dataset(testDataDF,label='ptype')
boostedDecisionTree = tfdf.keras.GradientBoostedTreesModel(max_depth= 4, num_trees = 50)
boostedDecisionTree.fit(trainingData, verbose=2)
results = boostedDecisionTree.predict(testData)
print(results)

In case this helps, here is what I get when I print out the variable testData:

<PrefetchDataset element_spec=({'E': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'px': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'py': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'pz': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'E1E9': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'E9E25': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'docaTrack': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'sumU': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'sumV': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'preshowerE': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'sigLong': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'sigTrans': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'sigTheta': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'E_L2': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'E_L3': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'E_L4': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'dEdxCDC': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'dEdxFDC': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'tShower': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'tTrack': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'thetac': TensorSpec(shape=(None,), dtype=tf.float64, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

What I expect from this is a list of values that occur in the ptype column (they are all integers, and can only take on certain values). However, the output of this code is:

[[3.1166435e-03 3.5639703e-03 5.6484230e-03 ... 6.8573549e-04
  9.3489707e-01 8.1527224e-03]
 [4.2285807e-03 4.7440380e-01 2.0284561e-02 ... 5.7000392e-03
  4.8168894e-02 3.8701162e-02]
 [1.2789826e-03 3.1301938e-03 2.9117553e-03 ... 1.2747306e-02
  3.7976676e-03 2.0397757e-03]
 ...
 [3.2000155e-03 6.0929000e-02 1.4465253e-01 ... 3.7460185e-03
  5.2436376e-01 1.2216719e-01]
 [1.8458949e-02 4.1683179e-01 3.9546650e-02 ... 6.7035188e-03
  9.7159848e-02 8.9666203e-02]
 [4.4867280e-01 8.1635034e-03 6.7420891e-03 ... 2.5453945e-03
  1.0263861e-02 1.5434443e-02]]

which makes absolutely no sense as an output. I have no clue how to interpret this- Iā€™m not even sure how to troubleshoot, so any help would be greatly appreciated. Thank you in advance!

Hi @duberii ,

I think GradientBoostedTreesModel from TensorFlow Decision Forests (TFDF), the model outputs probabilities for each class by default.

I just ran a sample notebook with the same model as above. It outputs the probabilities , which is why you are seeing different results than expected.After you get the probabilities, use argmax() to convert the probabilities to class labels.

Thanks.

1 Like