Witam,
robie projekt związany z RandomForestClassificator chciałbym użyć danych z mojego pliku csv. Na moje nieszczęście mój dataset ma dużo zmiennych tekstowych. Pomyślałem zatem, że zamienię je na zmienne liczbowe. Oczywiście robię tak i po uruchomieniu wyskakuje taki błąd:
C:/Users/bitel/PycharmProjects/TensorFlow/venv/RandomForest.py:30: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
forest.fit(X_train,y_train)
Traceback (most recent call last):
File "C:/Users/bitel/PycharmProjects/TensorFlow/venv/RandomForest.py", line 30, in <module>
forest.fit(X_train,y_train)
File "C:\Users\bitel\PycharmProjects\TensorFlow\venv\lib\site-packages\sklearn\ensemble\forest.py", line 276, in fit
y, expanded_class_weight = self._validate_y_class_weight(y)
File "C:\Users\bitel\PycharmProjects\TensorFlow\venv\lib\site-packages\sklearn\ensemble\forest.py", line 476, in _validate_y_class_weight
check_classification_targets(y)
File "C:\Users\bitel\PycharmProjects\TensorFlow\venv\lib\site-packages\sklearn\utils\multiclass.py", line 171, in check_classification_targets
raise ValueError("Unknown label type: %r" % y_type)
ValueError: Unknown label type: 'continuous'
Co muszę zrobić, aby rozwiązać ten błąd?
Kod:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import pandas as pd
import numpy as np
chocolate = pd.read_csv("C:\choco.csv", delimiter=',')
print(chocolate.head())
#enc = preprocessing.OneHotEncoder
#enc.fit(chocolate[0])
data_random = chocolate.sample(frac=1)
keep_col = ['Company', 'Cocoa Percent', 'Company Location', 'Broad Bean Origin']
rating_col = ['Rating']
data_keep = data_random[keep_col]
ratings = data_random[rating_col]
print(ratings)
list_ratings = np.array(ratings).tolist()
print(list_ratings)
data_dummies = pd.get_dummies(data_keep)
print(data_dummies)
X_train, X_test, y_train, y_test = train_test_split(data_dummies,list_ratings, random_state=0)
forest = RandomForestClassifier(n_estimators=100,random_state=0)
forest.fit(X_train,y_train)
forest.score(X_train,y_train)