Révision IA : Fondamentaux du Machine Learning¶
Tout ce qui est à reviser pour la L3 informatique à UPEC (2026)
Manipulation des données¶
Manipulation Dataset¶
In [22]:
Copied!
%matplotlib inline
from sklearn.datasets import load_iris
iris_dataset = load_iris()
%matplotlib inline
from sklearn.datasets import load_iris
iris_dataset = load_iris()
In [23]:
Copied!
print("Keys of iris_dataset:\n", iris_dataset.keys())
print("Keys of iris_dataset:\n", iris_dataset.keys())
Keys of iris_dataset: dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
In [24]:
Copied!
print("Type of data:", type(iris_dataset['data']))
print("Type of data:", type(iris_dataset['data']))
Type of data: <class 'numpy.ndarray'>
In [25]:
Copied!
print(iris_dataset['DESCR'])
print(iris_dataset['DESCR'])
.. _iris_dataset:
Iris plants dataset
--------------------
**Data Set Characteristics:**
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:
============== ==== ==== ======= ===== ====================
Min Max Mean SD Class Correlation
============== ==== ==== ======= ===== ====================
sepal length: 4.3 7.9 5.84 0.83 0.7826
sepal width: 2.0 4.4 3.05 0.43 -0.4194
petal length: 1.0 6.9 3.76 1.76 0.9490 (high!)
petal width: 0.1 2.5 1.20 0.76 0.9565 (high!)
============== ==== ==== ======= ===== ====================
:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.
This is perhaps the best known database to be found in the
pattern recognition literature. Fisher's paper is a classic in the field and
is referenced frequently to this day. (See Duda & Hart, for example.) The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant. One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.
.. dropdown:: References
- Fisher, R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments". IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule". IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64. Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...
In [26]:
Copied!
print("Shape of data:", iris_dataset.data.shape)
print("Shape of data:", iris_dataset.data.shape)
Shape of data: (150, 4)
In [27]:
Copied!
print("Target names:", iris_dataset['target_names'])
print("Feature names:\n", iris_dataset['feature_names'])
print("Target names:", iris_dataset['target_names'])
print("Feature names:\n", iris_dataset['feature_names'])
Target names: ['setosa' 'versicolor' 'virginica'] Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
In [28]:
Copied!
print("data:\n", iris_dataset['data'][0:5])
print("target:\n", iris_dataset['target'])
print("data:\n", iris_dataset['data'][0:5])
print("target:\n", iris_dataset['target'])
data: [[5.1 3.5 1.4 0.2] [4.9 3. 1.4 0.2] [4.7 3.2 1.3 0.2] [4.6 3.1 1.5 0.2] [5. 3.6 1.4 0.2]] target: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
Manipulation Numpy¶
In [29]:
Copied!
import numpy as np
x = np.array([
[1, 2, 3],
[4, 5, 6]
])
import numpy as np
x = np.array([
[1, 2, 3],
[4, 5, 6]
])
In [30]:
Copied!
print(x.shape)
print(x.shape)
(2, 3)
In [31]:
Copied!
print("max: ", x.max())
print("min: ", x.min())
print("max: ", x.max())
print("min: ", x.min())
max: 6 min: 1
In [32]:
Copied!
# N c'est le reste qui est egal a -1
# matrice 1xN
print("1xN:\n", x.reshape(1, -1), end="\n\n")
# matrice 2xN
print("2xN:\n", x.reshape(2, -1), end="\n\n")
# matrice 3xN
print("3xN:\n", x.reshape(3, -1), end="\n\n")
# matrice Nx1
print("Nx1:\n", x.reshape(-1, 1), end="\n\n")
# matrice Nx2
print("Nx2:\n", x.reshape(-1, 2), end="\n\n")
# matrice Nx3
print("Nx3:\n", x.reshape(-1, 3), end="\n\n")
# N c'est le reste qui est egal a -1
# matrice 1xN
print("1xN:\n", x.reshape(1, -1), end="\n\n")
# matrice 2xN
print("2xN:\n", x.reshape(2, -1), end="\n\n")
# matrice 3xN
print("3xN:\n", x.reshape(3, -1), end="\n\n")
# matrice Nx1
print("Nx1:\n", x.reshape(-1, 1), end="\n\n")
# matrice Nx2
print("Nx2:\n", x.reshape(-1, 2), end="\n\n")
# matrice Nx3
print("Nx3:\n", x.reshape(-1, 3), end="\n\n")
1xN: [[1 2 3 4 5 6]] 2xN: [[1 2 3] [4 5 6]] 3xN: [[1 2] [3 4] [5 6]] Nx1: [[1] [2] [3] [4] [5] [6]] Nx2: [[1 2] [3 4] [5 6]] Nx3: [[1 2 3] [4 5 6]]
In [33]:
Copied!
print("Les valeurs de x à l'index 1 de chaque ligne:", x[:, 1], end="\n\n") # marche uniquement avec les matrices numpy
print("x[1, :] et x[1] sont équivalents:", x[1, :] == x[1], end="\n\n")
print("Les valeurs de x à l'index 1 de chaque ligne:", x[:, 1], end="\n\n") # marche uniquement avec les matrices numpy
print("x[1, :] et x[1] sont équivalents:", x[1, :] == x[1], end="\n\n")
Les valeurs de x à l'index 1 de chaque ligne: [2 5] x[1, :] et x[1] sont équivalents: [ True True True]
Modèles de Classification¶
Nearest Neighbors (k-NN)¶
In [ ]:
Copied!
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# librairie uniquement destiné à dessiner KNN (à ne pas retenir !)
from mlxtend.plotting import plot_decision_regions
# ce dataset est une classification (classifier) qui permet
# predire si n attributs est dans cette étiquette (target)
# - class:
# - Iris-Setosa
# - Iris-Versicolour
# - Iris-Virginica
iris_dataset = load_iris()
X = iris_dataset.data[:, 0:2]
y = iris_dataset.target
# méthode pour separer en 2 catégories, un pour
# l'apprentissage (train), l'autre pour la prediction (test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# ce modele est uniquement utile pour la classification
knn_classifier = KNeighborsClassifier(n_neighbors=3)
knn_classifier.fit(X_train, y_train)
plot_decision_regions(X_train, y_train, clf=knn_classifier, legend=2)
plt.show()
# prediction de X_test (pour faire la comparaison avec y_test)
y_pred_test = knn_classifier.predict(X_test)
# La meilleure valeur est 1 et la pire est 0 pour tout les scores
print("score= ", knn_classifier.score(X_test, y_test))
print("precision score=", precision_score(y_test, y_pred_test, average=None).mean())
print("cross_val_score=", cross_val_score(knn_classifier, X_train, y_train, cv=5,scoring="accuracy").mean())
# comparaison entre prediction et valeur réelle
print("y_pred_test=\t", y_pred_test)
print("y_test= \t", y_test)
print("comparaison=\t", y_test == y_pred_test)
# - class:
# - Iris-Setosa
# - Iris-Versicolour
# - Iris-Virginica
# ligne: y_test, colonne: y_pred_test
# - si diagonale: prediction correcte
# - sinon un chiffre est hors-diagonle,
# erreur de classification
#
# les lignes et colonnes sont les `classes` y
# (Setosa, Versicolour, Virginica)
cm = confusion_matrix(y_test, y_pred_test)
print(cm)
# la matrice de confusion bien représenté (matplotlib)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=knn_classifier.classes_).plot()
plt.show()
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# librairie uniquement destiné à dessiner KNN (à ne pas retenir !)
from mlxtend.plotting import plot_decision_regions
# ce dataset est une classification (classifier) qui permet
# predire si n attributs est dans cette étiquette (target)
# - class:
# - Iris-Setosa
# - Iris-Versicolour
# - Iris-Virginica
iris_dataset = load_iris()
X = iris_dataset.data[:, 0:2]
y = iris_dataset.target
# méthode pour separer en 2 catégories, un pour
# l'apprentissage (train), l'autre pour la prediction (test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# ce modele est uniquement utile pour la classification
knn_classifier = KNeighborsClassifier(n_neighbors=3)
knn_classifier.fit(X_train, y_train)
plot_decision_regions(X_train, y_train, clf=knn_classifier, legend=2)
plt.show()
# prediction de X_test (pour faire la comparaison avec y_test)
y_pred_test = knn_classifier.predict(X_test)
# La meilleure valeur est 1 et la pire est 0 pour tout les scores
print("score= ", knn_classifier.score(X_test, y_test))
print("precision score=", precision_score(y_test, y_pred_test, average=None).mean())
print("cross_val_score=", cross_val_score(knn_classifier, X_train, y_train, cv=5,scoring="accuracy").mean())
# comparaison entre prediction et valeur réelle
print("y_pred_test=\t", y_pred_test)
print("y_test= \t", y_test)
print("comparaison=\t", y_test == y_pred_test)
# - class:
# - Iris-Setosa
# - Iris-Versicolour
# - Iris-Virginica
# ligne: y_test, colonne: y_pred_test
# - si diagonale: prediction correcte
# - sinon un chiffre est hors-diagonle,
# erreur de classification
#
# les lignes et colonnes sont les `classes` y
# (Setosa, Versicolour, Virginica)
cm = confusion_matrix(y_test, y_pred_test)
print(cm)
# la matrice de confusion bien représenté (matplotlib)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=knn_classifier.classes_).plot()
plt.show()
score= 0.8 precision score= 0.8047138047138046 cross_val_score= 0.7083333333333333 y_pred_test= [1 0 2 1 1 0 1 2 1 2 2 0 0 0 0 2 2 1 1 1 0 1 0 1 2 2 1 2 0 0] y_test= [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0] comparaison= [ True True True True True True True True True False True True True True True False True True True False True False True False True True False True True True] [[10 0 0] [ 0 7 2] [ 0 4 7]]
Support Vector Classifier (SVC)¶
In [35]:
Copied!
from sklearn.svm._classes import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
# librairie uniquement destiné à dessiner KNN (à ne pas retenir !)
from mlxtend.plotting import plot_decision_regions
iris_dataset = load_iris()
X = iris_dataset.data[:, 0:2]
y = iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
svc: SVC = SVC()
svc.fit(X_train, y_train)
plot_decision_regions(X_train, y_train, clf=svc, legend=2)
plt.show()
# prediction de X_test (pour faire la comparaison avec y_test)
y_pred_test = svc.predict(X_test)
# La meilleure valeur est 1 et la pire est 0 pour tout les scores
print("score= ", svc.score(X_test, y_test))
print("precision score=", precision_score(y_test, y_pred_test, average=None).mean())
print("cross_val_score=", cross_val_score(svc, X_train, y_train, cv=5,scoring="accuracy").mean())
# comparaison entre prediction et valeur réelle
print("y_pred_test=\t", y_pred_test)
print("y_test= \t", y_test)
print("comparaison=\t", y_test == y_pred_test)
# - class:
# - Iris-Setosa
# - Iris-Versicolour
# - Iris-Virginica
# ligne: y_test, colonne: y_pred_test
# - si diagonale: prediction correcte
# - sinon un chiffre est hors-diagonle,
# erreur de classification
#
# les lignes et colonnes sont les `classes` y
# (Setosa, Versicolour, Virginica)
cm = confusion_matrix(y_test, y_pred_test)
print(cm)
# la matrice de confusion bien représenté (matplotlib)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=svc.classes_).plot()
plt.show()
from sklearn.svm._classes import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
# librairie uniquement destiné à dessiner KNN (à ne pas retenir !)
from mlxtend.plotting import plot_decision_regions
iris_dataset = load_iris()
X = iris_dataset.data[:, 0:2]
y = iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
svc: SVC = SVC()
svc.fit(X_train, y_train)
plot_decision_regions(X_train, y_train, clf=svc, legend=2)
plt.show()
# prediction de X_test (pour faire la comparaison avec y_test)
y_pred_test = svc.predict(X_test)
# La meilleure valeur est 1 et la pire est 0 pour tout les scores
print("score= ", svc.score(X_test, y_test))
print("precision score=", precision_score(y_test, y_pred_test, average=None).mean())
print("cross_val_score=", cross_val_score(svc, X_train, y_train, cv=5,scoring="accuracy").mean())
# comparaison entre prediction et valeur réelle
print("y_pred_test=\t", y_pred_test)
print("y_test= \t", y_test)
print("comparaison=\t", y_test == y_pred_test)
# - class:
# - Iris-Setosa
# - Iris-Versicolour
# - Iris-Virginica
# ligne: y_test, colonne: y_pred_test
# - si diagonale: prediction correcte
# - sinon un chiffre est hors-diagonle,
# erreur de classification
#
# les lignes et colonnes sont les `classes` y
# (Setosa, Versicolour, Virginica)
cm = confusion_matrix(y_test, y_pred_test)
print(cm)
# la matrice de confusion bien représenté (matplotlib)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=svc.classes_).plot()
plt.show()
score= 0.9 precision score= 0.9027777777777778 cross_val_score= 0.7750000000000001 y_pred_test= [1 0 2 1 2 0 1 2 1 1 2 0 0 0 0 2 2 1 1 2 0 1 0 2 2 2 2 2 0 0] y_test= [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0] comparaison= [ True True True True False True True True True True True True True True True False True True True True True False True True True True True True True True] [[10 0 0] [ 0 7 2] [ 0 1 10]]
Arbres de Décision (DTC)¶
In [36]:
Copied!
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
plot_tree(dtc)
y_pred_test = dtc.predict(X_test)
cm = confusion_matrix(y_test, y_pred_test)
# la matrice de confusion bien représenté (matplotlib)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dtc.classes_).plot()
plt.show()
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
plot_tree(dtc)
y_pred_test = dtc.predict(X_test)
cm = confusion_matrix(y_test, y_pred_test)
# la matrice de confusion bien représenté (matplotlib)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=dtc.classes_).plot()
plt.show()
Prétraitement et Pipelines¶
Normalisation vs Standardisation¶
In [37]:
Copied!
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import precision_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
# librairie uniquement destiné à dessiner KNN (à ne pas retenir !)
from mlxtend.plotting import plot_decision_regions
iris_dataset = load_iris()
scalers = [
("Standardisation (StandardScaler)", StandardScaler()),
("Normalisation (MinMaxScaler)", MinMaxScaler()),
]
X = iris_dataset.data[:, 0:2]
y = iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
for name, scaler in scalers:
# on peut aussi le faire avec make_pipeline
# model = make_pipeline(SVC(), scaler)
scaler.fit(X_train, y_train)
# les features redimensionné
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = SVC()
model.fit(X_train_scaled, y_train)
plot_decision_regions(X_train_scaled, y_train, model, legend=2)
plt.title(name)
plt.show()
# prediction de X_test (pour faire la comparaison avec y_test)
y_pred_test = model.predict(X_test_scaled)
# La meilleure valeur est 1 et la pire est 0 pour tout les scores
print(f"score {name}= ", model.score(X_test_scaled, y_test))
print(f"precision score{name}=", precision_score(y_test, y_pred_test, average=None).mean())
print(
f"cross_val_score{name}=",
cross_val_score(model, X_train_scaled, y_train, cv=5, scoring="accuracy").mean(),
)
# comparaison entre prediction et valeur réelle
print(f"y_pred_test {name}=\t", y_pred_test)
print(f"y_test {name}= \t", y_test)
print(f"comparaison {name}=\t", y_test == y_pred_test)
cm = confusion_matrix(y_test, y_pred_test)
# la matrice de confusion bien représenté (matplotlib)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_).plot()
plt.title(name)
plt.show()
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import precision_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
# librairie uniquement destiné à dessiner KNN (à ne pas retenir !)
from mlxtend.plotting import plot_decision_regions
iris_dataset = load_iris()
scalers = [
("Standardisation (StandardScaler)", StandardScaler()),
("Normalisation (MinMaxScaler)", MinMaxScaler()),
]
X = iris_dataset.data[:, 0:2]
y = iris_dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
for name, scaler in scalers:
# on peut aussi le faire avec make_pipeline
# model = make_pipeline(SVC(), scaler)
scaler.fit(X_train, y_train)
# les features redimensionné
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = SVC()
model.fit(X_train_scaled, y_train)
plot_decision_regions(X_train_scaled, y_train, model, legend=2)
plt.title(name)
plt.show()
# prediction de X_test (pour faire la comparaison avec y_test)
y_pred_test = model.predict(X_test_scaled)
# La meilleure valeur est 1 et la pire est 0 pour tout les scores
print(f"score {name}= ", model.score(X_test_scaled, y_test))
print(f"precision score{name}=", precision_score(y_test, y_pred_test, average=None).mean())
print(
f"cross_val_score{name}=",
cross_val_score(model, X_train_scaled, y_train, cv=5, scoring="accuracy").mean(),
)
# comparaison entre prediction et valeur réelle
print(f"y_pred_test {name}=\t", y_pred_test)
print(f"y_test {name}= \t", y_test)
print(f"comparaison {name}=\t", y_test == y_pred_test)
cm = confusion_matrix(y_test, y_pred_test)
# la matrice de confusion bien représenté (matplotlib)
ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_).plot()
plt.title(name)
plt.show()
score Standardisation (StandardScaler)= 0.8333333333333334 precision scoreStandardisation (StandardScaler)= 0.8333333333333334 cross_val_scoreStandardisation (StandardScaler)= 0.7666666666666667 y_pred_test Standardisation (StandardScaler)= [1 0 2 1 2 0 1 2 1 1 2 0 0 0 0 2 2 1 1 1 0 1 0 1 2 2 2 2 0 0] y_test Standardisation (StandardScaler)= [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0] comparaison Standardisation (StandardScaler)= [ True True True True False True True True True True True True True True True False True True True False True False True False True True True True True True]
score Normalisation (MinMaxScaler)= 0.9 precision scoreNormalisation (MinMaxScaler)= 0.9027777777777778 cross_val_scoreNormalisation (MinMaxScaler)= 0.7666666666666667 y_pred_test Normalisation (MinMaxScaler)= [1 0 2 1 2 0 1 2 1 1 2 0 0 0 0 2 2 1 1 2 0 1 0 2 2 2 2 2 0 0] y_test Normalisation (MinMaxScaler)= [1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0] comparaison Normalisation (MinMaxScaler)= [ True True True True False True True True True True True True True True True False True True True True True False True True True True True True True True]
Modèles de Régression¶
k-NN Regressor¶
In [38]:
Copied!
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
diabetes_dataset = load_diabetes()
X = diabetes_dataset.data
y = diabetes_dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
knn_regressor = KNeighborsRegressor(n_neighbors=3)
knn_regressor.fit(X_train, y_train)
y_pred_test = knn_regressor.predict(X_test)
print(knn_regressor.score(X_test, y_test))
print(mean_squared_error(y_test, y_pred_test))
from sklearn.neighbors import KNeighborsRegressor
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
diabetes_dataset = load_diabetes()
X = diabetes_dataset.data
y = diabetes_dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
knn_regressor = KNeighborsRegressor(n_neighbors=3)
knn_regressor.fit(X_train, y_train)
y_pred_test = knn_regressor.predict(X_test)
print(knn_regressor.score(X_test, y_test))
print(mean_squared_error(y_test, y_pred_test))
0.36498737331014663 3364.3932584269664
K-Means¶
In [39]:
Copied!
from sklearn.cluster import KMeans
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
diabetes_dataset = load_diabetes()
X = diabetes_dataset.data
y = diabetes_dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = KMeans(n_clusters=4)
model.fit(X_train, y_train)
# - nertie élevée : Les points sont très dispersés
# autour des centres. Les clusters sont mal définis
# ou trop grands.
# - Inertie faible : Les points sont très regroupés
# autour des centres. Les clusters sont denses.
# - Inertie = 0 : Cela arrive si tu as autant de
# clusters que de points (chaque point est son propre
# centre). Ce n'est pas un bon modèle, c'est de
# l'overfitting.
print("performance: ", model.inertia_)
print("labels: ", model.labels_[0:6])
# coordonnées du centre des points (10 attributs == 10 dimension)
print("coordonnée cluster:", model.cluster_centers_)
# Elbow (prend la valeur se situant approximativement au coude)
inertias = []
range = np.arange(1, 5)
for k in range:
model = KMeans(n_clusters=k, random_state=42)
model.fit(X_train, y_train)
inertias.append(model.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(range, inertias, marker='o', linestyle='--')
plt.xlabel('Nombre de clusters (k)')
plt.ylabel('Inertia')
plt.title('Méthode du Coude (Elbow Method)')
plt.grid(True)
plt.show()
from sklearn.cluster import KMeans
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
diabetes_dataset = load_diabetes()
X = diabetes_dataset.data
y = diabetes_dataset.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = KMeans(n_clusters=4)
model.fit(X_train, y_train)
# - nertie élevée : Les points sont très dispersés
# autour des centres. Les clusters sont mal définis
# ou trop grands.
# - Inertie faible : Les points sont très regroupés
# autour des centres. Les clusters sont denses.
# - Inertie = 0 : Cela arrive si tu as autant de
# clusters que de points (chaque point est son propre
# centre). Ce n'est pas un bon modèle, c'est de
# l'overfitting.
print("performance: ", model.inertia_)
print("labels: ", model.labels_[0:6])
# coordonnées du centre des points (10 attributs == 10 dimension)
print("coordonnée cluster:", model.cluster_centers_)
# Elbow (prend la valeur se situant approximativement au coude)
inertias = []
range = np.arange(1, 5)
for k in range:
model = KMeans(n_clusters=k, random_state=42)
model.fit(X_train, y_train)
inertias.append(model.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(range, inertias, marker='o', linestyle='--')
plt.xlabel('Nombre de clusters (k)')
plt.ylabel('Inertia')
plt.title('Méthode du Coude (Elbow Method)')
plt.grid(True)
plt.show()
performance: 4.8866570283713235 labels: [0 0 3 2 2 2] coordonnée cluster: [[ 0.01727448 0.0416875 0.02560984 0.02266229 0.02619673 0.02909993 -0.03954555 0.05179512 0.03896623 0.03119948] [ 0.01795409 0.012123 -0.02054449 -0.00788777 -0.01157167 -0.01687278 0.03503045 -0.03287191 -0.0246165 -0.00805878] [-0.04900123 -0.02285381 -0.04117326 -0.04032911 -0.05017941 -0.04385286 0.00803689 -0.03792245 -0.04425438 -0.04498403] [ 0.00579585 -0.04355843 0.02964355 0.01749058 0.0178351 0.01366209 0.00184565 0.00253196 0.01804003 0.01393744]]
Apprentissage non-supervisée¶
In [40]:
Copied!
from sklearn.cluster import KMeans
from sklearn.datasets import load_diabetes
# K-Means est un model se basant sur le clustering ce qui est
# la principale méthode utilisé pour l'apprentissage non superpisé.
# qui est un apprentissage sans étiquette (target)
diabetes_dataset = load_diabetes()
X = diabetes_dataset.data
model = KMeans(n_clusters=4)
model.fit(X)
print("performance: ", model.inertia_)
print("labels: ", model.labels_[0:6])
print("coordonnée cluster:", model.cluster_centers_)
# Elbow
inertias = []
range = np.arange(1, 5)
for k in range:
model = KMeans(n_clusters=k, random_state=42)
model.fit(X)
inertias.append(model.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(range, inertias, marker='o', linestyle='--')
plt.xlabel('Nombre de clusters (k)')
plt.ylabel('Inertia')
plt.title('Méthode du Coude (Elbow Method)')
plt.grid(True)
plt.show()
from sklearn.cluster import KMeans
from sklearn.datasets import load_diabetes
# K-Means est un model se basant sur le clustering ce qui est
# la principale méthode utilisé pour l'apprentissage non superpisé.
# qui est un apprentissage sans étiquette (target)
diabetes_dataset = load_diabetes()
X = diabetes_dataset.data
model = KMeans(n_clusters=4)
model.fit(X)
print("performance: ", model.inertia_)
print("labels: ", model.labels_[0:6])
print("coordonnée cluster:", model.cluster_centers_)
# Elbow
inertias = []
range = np.arange(1, 5)
for k in range:
model = KMeans(n_clusters=k, random_state=42)
model.fit(X)
inertias.append(model.inertia_)
plt.figure(figsize=(8, 5))
plt.plot(range, inertias, marker='o', linestyle='--')
plt.xlabel('Nombre de clusters (k)')
plt.ylabel('Inertia')
plt.title('Méthode du Coude (Elbow Method)')
plt.grid(True)
plt.show()
performance: 6.020855278901396 labels: [2 1 2 3 3 1] coordonnée cluster: [[ 0.00994588 0.02278107 0.03594701 0.02712021 0.05157302 0.0484309 -0.03702533 0.0653238 0.05370069 0.04216186] [-0.03193302 -0.02664382 -0.03352058 -0.03519564 -0.03203236 -0.03408278 0.03095279 -0.04049204 -0.04071605 -0.03279523] [ 0.01674244 0.03937076 0.01039053 0.00781913 -0.02372413 -0.01336213 -0.03204393 0.01008546 0.00642312 0.00885634] [ 0.01793183 -0.02731041 0.00625965 0.01905516 0.03182917 0.02504278 0.02415163 -0.00763914 0.00667688 0.0018929 ]]
Linear Regression¶
In [41]:
Copied!
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
diabetes = load_diabetes()
X = diabetes.data[:, np.newaxis, 2]
y = diabetes.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
sns.set_theme(style="whitegrid")
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color="black", alpha=0.5, label="Données réelles")
# coordonnée x
x_axe = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
# coordonnée y
y_courbe = regr.coef_[0] * x_axe + regr.intercept_
# droite crée par l'algorithme
plt.plot(x_axe, y_courbe, color='red', label='Droite théorique')
plt.title(
f"Régression Linéaire : IMC vs Progression Diabète\n$R^2$: {r2_score(y_test, y_pred):.2f}",
fontsize=14,
)
plt.xlabel("Indice de Masse Corporelle (Normalisé)")
plt.ylabel("Progression de la maladie")
plt.legend()
plt.show()
print(f"Coefficient (w): {regr.coef_[0]:.2f}")
print(f"Intercept (b): {regr.intercept_:.2f}")
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
diabetes = load_diabetes()
X = diabetes.data[:, np.newaxis, 2]
y = diabetes.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
regr = LinearRegression()
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)
sns.set_theme(style="whitegrid")
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, color="black", alpha=0.5, label="Données réelles")
# coordonnée x
x_axe = np.linspace(X.min(), X.max(), 100).reshape(-1, 1)
# coordonnée y
y_courbe = regr.coef_[0] * x_axe + regr.intercept_
# droite crée par l'algorithme
plt.plot(x_axe, y_courbe, color='red', label='Droite théorique')
plt.title(
f"Régression Linéaire : IMC vs Progression Diabète\n$R^2$: {r2_score(y_test, y_pred):.2f}",
fontsize=14,
)
plt.xlabel("Indice de Masse Corporelle (Normalisé)")
plt.ylabel("Progression de la maladie")
plt.legend()
plt.show()
print(f"Coefficient (w): {regr.coef_[0]:.2f}")
print(f"Intercept (b): {regr.intercept_:.2f}")
Coefficient (w): 998.58 Intercept (b): 152.00
PCA¶
In [42]:
Copied!
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
# PCA permet de limité le nombre d'attribut utilisé lors de
# l'apprentissage, elle recupère automatiquement les attributs
# avec la plus grande variance
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# La variance permet de déterminer l'impact
# qu'as un attribut sur la prédiction
variance = pca.explained_variance_ratio_.sum()
print(f"Information conservée : {variance:.2%}")
# Plus la variance est grande plus elle aura un impact
# sur la prédiction à l'inverse plus elle est faible
# moins elle aura d'impact
print(pca.explained_variance_ratio_)
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
# PCA permet de limité le nombre d'attribut utilisé lors de
# l'apprentissage, elle recupère automatiquement les attributs
# avec la plus grande variance
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# La variance permet de déterminer l'impact
# qu'as un attribut sur la prédiction
variance = pca.explained_variance_ratio_.sum()
print(f"Information conservée : {variance:.2%}")
# Plus la variance est grande plus elle aura un impact
# sur la prédiction à l'inverse plus elle est faible
# moins elle aura d'impact
print(pca.explained_variance_ratio_)
Information conservée : 55.17% [0.40242108 0.14923197]