my problem is that I can not get any good results when training a model, whatever I use, doesnt work. I used KNN, Random Forest Regressor, Gradient Boosting Regressor, Linear Regression and I used Dense Layers.
I collected 34.400 lines of house price data. It contains these columns:
price, area, absolute_area, room, floor_count, lat, lng, area_abs_area_difference, area_room_ratio, building_age_new, building_age_very_young, building_age_young, building_age_mid, building_age_old, building_age_very_old
It goes on about 34.400 lines.
What I have tried:
I need to create some models using this data. First, I separated the dataset into train and validation set:
train_df, val_df = train_test_split(dataset, test_size=0.15, random_state=42)
x_train = train_df.drop(‘price’, axis=1)
y_train = train_df[‘price’]
x_val = val_df.drop(‘price’, axis=1)
y_val = val_df[‘price’]
I used StandardScaler() to scale the data:
scaler = StandardScaler().fit(x_train)
import pickle
with open(‘scaler.pkl’, ‘wb’) as f:
pickle.dump(scaler, f)
def preprocessor(X):
A = np.copy(X)
A = scaler.transform(X)
return A
X_train_preprocessed, X_val_preprocessed = preprocessor(x_train), preprocessor(x_val)
Now, coming to my models:
Linear Regression:
lm = LinearRegression().fit(X_train_selected, y_train)
y_train_pred = model.predict(X_train_selected)
y_val_pred = model.predict(X_val_selected)
train_mse = mse(y_train, y_train_pred, squared=False)
val_mse = mse(y_val, y_val_pred, squared=False)
print(“Training MSE:”, train_mse)
print(“Validation MSE:”, val_mse)
The Output:
Training MSE: 32026915.5375083
Validation MSE: 25336006.528607745
KNN:
knn = KNeighborsRegressor(n_neighbors=35).fit(X_train_preprocessed, y_train)
r2_train = knn.score(X_train_preprocessed, y_train)
r2_val = knn.score(X_val_preprocessed, y_val)
r2_train, r2_val
The Output:
(0.11238750123213292, 0.2333151002444518)
I used Random Forest Regressor and Gradient Boositng Regressor too they gave the same results.
As for the last model, I used Dense Layers, I must say that in my project i am planning to use multiple models to pick so I must choose the Dense Layers too.
I created a neural network like this:
medium_nn = Sequential()
medium_nn.add(InputLayer((14,)))
medium_nn.add(Dense(32, ‘relu’)) # What is ReLU?
medium_nn.add(Dropout(0.1))
medium_nn.add(Dense(16, ‘relu’))
medium_nn.add(Dense(1, ‘linear’))
opt = Adam(learning_rate=1)
cp = ModelCheckpoint(‘models/medium_nn’, save_best_only=True)
medium_nn.compile(optimizer=opt, loss=‘mse’, metrics=[RootMeanSquaredError()])
medium_nn.fit(x=X_train_preprocessed, y=y_train, validation_data=(X_val_preprocessed, y_val), callbacks=[cp], epochs=100, verbose=0)
y_train_pred_medium_nn = medium_nn.predict(X_train_preprocessed)
y_val_pred_medium_nn = medium_nn.predict(X_val_preprocessed)
medium_nn_r2_train = r2_score(y_train, y_train_pred_medium_nn)
medium_nn_r2_val = r2_score(y_val, y_val_pred_medium_nn)
print(“R2 Score - Training Set:”, medium_nn_r2_train)
print(“R2 Score - Validation Set:”, medium_nn_r2_val)
This is the output:
R2 Score - Training Set: 0.28221913486721617
R2 Score - Validation Set: 0.4202027388380072
This is the best I can do. What do you suggest that I should do? I am pretty new, I collected the data on my own from a website using selenium but i do not know what is causing my models to not learn. Am I lacking data or am I doing something wrong? Sorry for posting this long, I do not know what to do. I am pretty new to the topic.