How Can I get better results when training a House Price Prediction Model?

my problem is that I can not get any good results when training a model, whatever I use, doesnt work. I used KNN, Random Forest Regressor, Gradient Boosting Regressor, Linear Regression and I used Dense Layers.

I collected 34.400 lines of house price data. It contains these columns:

price, area, absolute_area, room, floor_count, lat, lng, area_abs_area_difference, area_room_ratio, building_age_new, building_age_very_young, building_age_young, building_age_mid, building_age_old, building_age_very_old

This is my dataset

It goes on about 34.400 lines.

What I have tried:

I need to create some models using this data. First, I separated the dataset into train and validation set:

train_df, val_df = train_test_split(dataset, test_size=0.15, random_state=42)

x_train = train_df.drop(‘price’, axis=1)
y_train = train_df[‘price’]

x_val = val_df.drop(‘price’, axis=1)
y_val = val_df[‘price’]

I used StandardScaler() to scale the data:

scaler = StandardScaler().fit(x_train)

import pickle

with open(‘scaler.pkl’, ‘wb’) as f:
pickle.dump(scaler, f)

def preprocessor(X):
A = np.copy(X)
A = scaler.transform(X)
return A

X_train_preprocessed, X_val_preprocessed = preprocessor(x_train), preprocessor(x_val)

Now, coming to my models:

Linear Regression:

lm = LinearRegression().fit(X_train_selected, y_train)

y_train_pred = model.predict(X_train_selected)
y_val_pred = model.predict(X_val_selected)

train_mse = mse(y_train, y_train_pred, squared=False)
val_mse = mse(y_val, y_val_pred, squared=False)

print(“Training MSE:”, train_mse)
print(“Validation MSE:”, val_mse)

The Output:

Training MSE: 32026915.5375083
Validation MSE: 25336006.528607745

KNN:

knn = KNeighborsRegressor(n_neighbors=35).fit(X_train_preprocessed, y_train)

r2_train = knn.score(X_train_preprocessed, y_train)
r2_val = knn.score(X_val_preprocessed, y_val)

r2_train, r2_val

The Output:

(0.11238750123213292, 0.2333151002444518)

I used Random Forest Regressor and Gradient Boositng Regressor too they gave the same results.

As for the last model, I used Dense Layers, I must say that in my project i am planning to use multiple models to pick so I must choose the Dense Layers too.

I created a neural network like this:

medium_nn = Sequential()
medium_nn.add(InputLayer((14,)))
medium_nn.add(Dense(32, ‘relu’)) # What is ReLU?
medium_nn.add(Dropout(0.1))
medium_nn.add(Dense(16, ‘relu’))
medium_nn.add(Dense(1, ‘linear’))

opt = Adam(learning_rate=1)
cp = ModelCheckpoint(‘models/medium_nn’, save_best_only=True)
medium_nn.compile(optimizer=opt, loss=‘mse’, metrics=[RootMeanSquaredError()])
medium_nn.fit(x=X_train_preprocessed, y=y_train, validation_data=(X_val_preprocessed, y_val), callbacks=[cp], epochs=100, verbose=0)

y_train_pred_medium_nn = medium_nn.predict(X_train_preprocessed)
y_val_pred_medium_nn = medium_nn.predict(X_val_preprocessed)

medium_nn_r2_train = r2_score(y_train, y_train_pred_medium_nn)
medium_nn_r2_val = r2_score(y_val, y_val_pred_medium_nn)

print(“R2 Score - Training Set:”, medium_nn_r2_train)
print(“R2 Score - Validation Set:”, medium_nn_r2_val)

This is the output:

R2 Score - Training Set: 0.28221913486721617
R2 Score - Validation Set: 0.4202027388380072

This is the best I can do. What do you suggest that I should do? I am pretty new, I collected the data on my own from a website using selenium but i do not know what is causing my models to not learn. Am I lacking data or am I doing something wrong? Sorry for posting this long, I do not know what to do. I am pretty new to the topic.

@totames,

Welcome to the Tensorflow Forum!

You can refer to this notebook for house price prediction may help you.

If you are still have issues, here are some suggestions to improve the model performance:

  1. Identify important features for predicting house prices using correlation analysis, feature importance from tree-based models or dimensionality reduction methods

  2. Handle outliers appropriately by removing, transforming, or using robust regression techniques

  3. Try to use k-fold cross-validation

  4. Optimize model hyperparameters through techniques like Grid search or random search

  5. Explore ensemble methods to combine multiple models

Please try as suggested above.

Thank you!

Your code contains this:
lm = LinearRegression().fit(X_train_selected, y_train)

y_train_pred = model.predict(X_train_selected)
y_val_pred = model.predict(X_val_selected)

But nowhere in your code is defining it.

Also, you’re using very small neurons (inputs/outputs) count.
Try to increase it so denses has at least 2048 of inter-connect links to other layers.

You can try to write automatic testor. Which will configure network for you.
And last, i am new to tensorflow, but not to NN. So check your layers.