reconnect moved files to git repo

2025-08-01 04:33:03 -04:00
commit 5d3c35492d
23190 changed files with 4750716 additions and 0 deletions
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/init.py
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/init.py
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/pycache/init.cpython-311.pyc
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/pycache/init.cpython-311.pyc
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/breast_cancer.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/breast_cancer.rst
@ -0,0 +1,118 @@
+.. _breast_cancer_dataset:
+
+Breast cancer wisconsin (diagnostic) dataset
+--------------------------------------------
+
+**Data Set Characteristics:**
+
+:Number of Instances: 569
+
+:Number of Attributes: 30 numeric, predictive attributes and the class
+
+:Attribute Information:
+    - radius (mean of distances from center to points on the perimeter)
+    - texture (standard deviation of gray-scale values)
+    - perimeter
+    - area
+    - smoothness (local variation in radius lengths)
+    - compactness (perimeter^2 / area - 1.0)
+    - concavity (severity of concave portions of the contour)
+    - concave points (number of concave portions of the contour)
+    - symmetry
+    - fractal dimension ("coastline approximation" - 1)
+
+    The mean, standard error, and "worst" or largest (mean of the three
+    worst/largest values) of these features were computed for each image,
+    resulting in 30 features.  For instance, field 0 is Mean Radius, field
+    10 is Radius SE, field 20 is Worst Radius.
+
+    - class:
+            - WDBC-Malignant
+            - WDBC-Benign
+
+:Summary Statistics:
+
+===================================== ====== ======
+                                        Min    Max
+===================================== ====== ======
+radius (mean):                        6.981  28.11
+texture (mean):                       9.71   39.28
+perimeter (mean):                     43.79  188.5
+area (mean):                          143.5  2501.0
+smoothness (mean):                    0.053  0.163
+compactness (mean):                   0.019  0.345
+concavity (mean):                     0.0    0.427
+concave points (mean):                0.0    0.201
+symmetry (mean):                      0.106  0.304
+fractal dimension (mean):             0.05   0.097
+radius (standard error):              0.112  2.873
+texture (standard error):             0.36   4.885
+perimeter (standard error):           0.757  21.98
+area (standard error):                6.802  542.2
+smoothness (standard error):          0.002  0.031
+compactness (standard error):         0.002  0.135
+concavity (standard error):           0.0    0.396
+concave points (standard error):      0.0    0.053
+symmetry (standard error):            0.008  0.079
+fractal dimension (standard error):   0.001  0.03
+radius (worst):                       7.93   36.04
+texture (worst):                      12.02  49.54
+perimeter (worst):                    50.41  251.2
+area (worst):                         185.2  4254.0
+smoothness (worst):                   0.071  0.223
+compactness (worst):                  0.027  1.058
+concavity (worst):                    0.0    1.252
+concave points (worst):               0.0    0.291
+symmetry (worst):                     0.156  0.664
+fractal dimension (worst):            0.055  0.208
+===================================== ====== ======
+
+:Missing Attribute Values: None
+
+:Class Distribution: 212 - Malignant, 357 - Benign
+
+:Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian
+
+:Donor: Nick Street
+
+:Date: November, 1995
+
+This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
+https://goo.gl/U2Uwz2
+
+Features are computed from a digitized image of a fine needle
+aspirate (FNA) of a breast mass.  They describe
+characteristics of the cell nuclei present in the image.
+
+Separating plane described above was obtained using
+Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
+Construction Via Linear Programming." Proceedings of the 4th
+Midwest Artificial Intelligence and Cognitive Science Society,
+pp. 97-101, 1992], a classification method which uses linear
+programming to construct a decision tree.  Relevant features
+were selected using an exhaustive search in the space of 1-4
+features and 1-3 separating planes.
+
+The actual linear program used to obtain the separating plane
+in the 3-dimensional space is that described in:
+[K. P. Bennett and O. L. Mangasarian: "Robust Linear
+Programming Discrimination of Two Linearly Inseparable Sets",
+Optimization Methods and Software 1, 1992, 23-34].
+
+This database is also available through the UW CS ftp server:
+
+ftp ftp.cs.wisc.edu
+cd math-prog/cpo-dataset/machine-learn/WDBC/
+
+.. dropdown:: References
+
+  - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction
+    for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on
+    Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
+    San Jose, CA, 1993.
+  - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and
+    prognosis via linear programming. Operations Research, 43(4), pages 570-577,
+    July-August 1995.
+  - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
+    to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)
+    163-171.
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/california_housing.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/california_housing.rst
@ -0,0 +1,46 @@
+.. _california_housing_dataset:
+
+California Housing dataset
+--------------------------
+
+**Data Set Characteristics:**
+
+:Number of Instances: 20640
+
+:Number of Attributes: 8 numeric, predictive attributes and the target
+
+:Attribute Information:
+    - MedInc        median income in block group
+    - HouseAge      median house age in block group
+    - AveRooms      average number of rooms per household
+    - AveBedrms     average number of bedrooms per household
+    - Population    block group population
+    - AveOccup      average number of household members
+    - Latitude      block group latitude
+    - Longitude     block group longitude
+
+:Missing Attribute Values: None
+
+This dataset was obtained from the StatLib repository.
+https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
+
+The target variable is the median house value for California districts,
+expressed in hundreds of thousands of dollars ($100,000).
+
+This dataset was derived from the 1990 U.S. census, using one row per census
+block group. A block group is the smallest geographical unit for which the U.S.
+Census Bureau publishes sample data (a block group typically has a population
+of 600 to 3,000 people).
+
+A household is a group of people residing within a home. Since the average
+number of rooms and bedrooms in this dataset are provided per household, these
+columns may take surprisingly large values for block groups with few households
+and many empty houses, such as vacation resorts.
+
+It can be downloaded/loaded using the
+:func:`sklearn.datasets.fetch_california_housing` function.
+
+.. rubric:: References
+
+- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
+  Statistics and Probability Letters, 33 (1997) 291-297
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/covtype.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/covtype.rst
@ -0,0 +1,30 @@
+.. _covtype_dataset:
+
+Forest covertypes
+-----------------
+
+The samples in this dataset correspond to 30×30m patches of forest in the US,
+collected for the task of predicting each patch's cover type,
+i.e. the dominant species of tree.
+There are seven covertypes, making this a multiclass classification problem.
+Each sample has 54 features, described on the
+`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
+Some of the features are boolean indicators,
+while others are discrete or continuous measurements.
+
+**Data Set Characteristics:**
+
+=================   ============
+Classes                        7
+Samples total             581012
+Dimensionality                54
+Features                     int
+=================   ============
+
+:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
+it returns a dictionary-like 'Bunch' object
+with the feature matrix in the ``data`` member
+and the target values in ``target``. If optional argument 'as_frame' is
+set to 'True', it will return ``data`` and ``target`` as pandas
+data frame, and there will be an additional member ``frame`` as well.
+The dataset will be downloaded from the web if necessary.
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/diabetes.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/diabetes.rst
@ -0,0 +1,38 @@
+.. _diabetes_dataset:
+
+Diabetes dataset
+----------------
+
+Ten baseline variables, age, sex, body mass index, average blood
+pressure, and six blood serum measurements were obtained for each of n =
+442 diabetes patients, as well as the response of interest, a
+quantitative measure of disease progression one year after baseline.
+
+**Data Set Characteristics:**
+
+:Number of Instances: 442
+
+:Number of Attributes: First 10 columns are numeric predictive values
+
+:Target: Column 11 is a quantitative measure of disease progression one year after baseline
+
+:Attribute Information:
+    - age     age in years
+    - sex
+    - bmi     body mass index
+    - bp      average blood pressure
+    - s1      tc, total serum cholesterol
+    - s2      ldl, low-density lipoproteins
+    - s3      hdl, high-density lipoproteins
+    - s4      tch, total cholesterol / HDL
+    - s5      ltg, possibly log of serum triglycerides level
+    - s6      glu, blood sugar level
+
+Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of `n_samples` (i.e. the sum of squares of each column totals 1).
+
+Source URL:
+https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html
+
+For more information see:
+Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
+(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/digits.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/digits.rst
@ -0,0 +1,46 @@
+.. _digits_dataset:
+
+Optical recognition of handwritten digits dataset
+--------------------------------------------------
+
+**Data Set Characteristics:**
+
+:Number of Instances: 1797
+:Number of Attributes: 64
+:Attribute Information: 8x8 image of integer pixels in the range 0..16.
+:Missing Attribute Values: None
+:Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
+:Date: July; 1998
+
+This is a copy of the test set of the UCI ML hand-written digits datasets
+https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits
+
+The data set contains images of hand-written digits: 10 classes where
+each class refers to a digit.
+
+Preprocessing programs made available by NIST were used to extract
+normalized bitmaps of handwritten digits from a preprinted form. From a
+total of 43 people, 30 contributed to the training set and different 13
+to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
+4x4 and the number of on pixels are counted in each block. This generates
+an input matrix of 8x8 where each element is an integer in the range
+0..16. This reduces dimensionality and gives invariance to small
+distortions.
+
+For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
+T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
+L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
+1994.
+
+.. dropdown:: References
+
+  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
+    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
+    Graduate Studies in Science and Engineering, Bogazici University.
+  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
+  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
+    Linear dimensionalityreduction using relevance weighted LDA. School of
+    Electrical and Electronic Engineering Nanyang Technological University.
+    2005.
+  - Claudio Gentile. A New Approximate Maximal Margin Classification
+    Algorithm. NIPS. 2000.
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/iris.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/iris.rst
@ -0,0 +1,63 @@
+.. _iris_dataset:
+
+Iris plants dataset
+--------------------
+
+**Data Set Characteristics:**
+
+:Number of Instances: 150 (50 in each of three classes)
+:Number of Attributes: 4 numeric, predictive attributes and the class
+:Attribute Information:
+    - sepal length in cm
+    - sepal width in cm
+    - petal length in cm
+    - petal width in cm
+    - class:
+            - Iris-Setosa
+            - Iris-Versicolour
+            - Iris-Virginica
+
+:Summary Statistics:
+
+============== ==== ==== ======= ===== ====================
+                Min  Max   Mean    SD   Class Correlation
+============== ==== ==== ======= ===== ====================
+sepal length:   4.3  7.9   5.84   0.83    0.7826
+sepal width:    2.0  4.4   3.05   0.43   -0.4194
+petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
+petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
+============== ==== ==== ======= ===== ====================
+
+:Missing Attribute Values: None
+:Class Distribution: 33.3% for each of 3 classes.
+:Creator: R.A. Fisher
+:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
+:Date: July, 1988
+
+The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
+from Fisher's paper. Note that it's the same as in R, but not as in the UCI
+Machine Learning Repository, which has two wrong data points.
+
+This is perhaps the best known database to be found in the
+pattern recognition literature.  Fisher's paper is a classic in the field and
+is referenced frequently to this day.  (See Duda & Hart, for example.)  The
+data set contains 3 classes of 50 instances each, where each class refers to a
+type of iris plant.  One class is linearly separable from the other 2; the
+latter are NOT linearly separable from each other.
+
+.. dropdown:: References
+
+  - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
+    Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
+    Mathematical Statistics" (John Wiley, NY, 1950).
+  - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
+    (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
+  - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
+    Structure and Classification Rule for Recognition in Partially Exposed
+    Environments".  IEEE Transactions on Pattern Analysis and Machine
+    Intelligence, Vol. PAMI-2, No. 1, 67-71.
+  - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
+    on Information Theory, May 1972, 431-433.
+  - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
+    conceptual clustering system finds 3 classes in the data.
+  - Many, many more ...
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/kddcup99.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/kddcup99.rst
@ -0,0 +1,94 @@
+.. _kddcup99_dataset:
+
+Kddcup 99 dataset
+-----------------
+
+The KDD Cup '99 dataset was created by processing the tcpdump portions
+of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,
+created by MIT Lincoln Lab [2]_. The artificial data (described on the `dataset's
+homepage <https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html>`_) was
+generated using a closed network and hand-injected attacks to produce a
+large number of different types of attack with normal activity in the
+background. As the initial goal was to produce a large training set for
+supervised learning algorithms, there is a large proportion (80.1%) of
+abnormal data which is unrealistic in real world, and inappropriate for
+unsupervised anomaly detection which aims at detecting 'abnormal' data, i.e.:
+
+* qualitatively different from normal data
+* in large minority among the observations.
+
+We thus transform the KDD Data set into two different data sets: SA and SF.
+
+* SA is obtained by simply selecting all the normal data, and a small
+  proportion of abnormal data to gives an anomaly proportion of 1%.
+
+* SF is obtained as in [3]_
+  by simply picking up the data whose attribute logged_in is positive, thus
+  focusing on the intrusion attack, which gives a proportion of 0.3% of
+  attack.
+
+* http and smtp are two subsets of SF corresponding with third feature
+  equal to 'http' (resp. to 'smtp').
+
+General KDD structure:
+
+================      ==========================================
+Samples total         4898431
+Dimensionality        41
+Features              discrete (int) or continuous (float)
+Targets               str, 'normal.' or name of the anomaly type
+================      ==========================================
+
+SA structure:
+
+================      ==========================================
+Samples total         976158
+Dimensionality        41
+Features              discrete (int) or continuous (float)
+Targets               str, 'normal.' or name of the anomaly type
+================      ==========================================
+
+SF structure:
+
+================      ==========================================
+Samples total         699691
+Dimensionality        4
+Features              discrete (int) or continuous (float)
+Targets               str, 'normal.' or name of the anomaly type
+================      ==========================================
+
+http structure:
+
+================      ==========================================
+Samples total         619052
+Dimensionality        3
+Features              discrete (int) or continuous (float)
+Targets               str, 'normal.' or name of the anomaly type
+================      ==========================================
+
+smtp structure:
+
+================      ==========================================
+Samples total         95373
+Dimensionality        3
+Features              discrete (int) or continuous (float)
+Targets               str, 'normal.' or name of the anomaly type
+================      ==========================================
+
+:func:`sklearn.datasets.fetch_kddcup99` will load the kddcup99 dataset; it
+returns a dictionary-like object with the feature matrix in the ``data`` member
+and the target values in ``target``. The "as_frame" optional argument converts
+``data`` into a pandas DataFrame and ``target`` into a pandas Series. The
+dataset will be downloaded from the web if necessary.
+
+.. rubric:: References
+
+.. [2] Analysis and Results of the 1999 DARPA Off-Line Intrusion
+       Detection Evaluation, Richard Lippmann, Joshua W. Haines,
+       David J. Fried, Jonathan Korba, Kumar Das.
+
+.. [3] K. Yamanishi, J.-I. Takeuchi, G. Williams, and P. Milne. Online
+       unsupervised outlier detection using finite mixtures with
+       discounting learning algorithms. In Proceedings of the sixth
+       ACM SIGKDD international conference on Knowledge discovery
+       and data mining, pages 320-324. ACM Press, 2000.
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/lfw.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/lfw.rst
@ -0,0 +1,124 @@
+.. _labeled_faces_in_the_wild_dataset:
+
+The Labeled Faces in the Wild face recognition dataset
+------------------------------------------------------
+
+This dataset is a collection of JPEG pictures of famous people collected
+over the internet, all details are available on the official website:
+
+http://vis-www.cs.umass.edu/lfw/
+
+Each picture is centered on a single face. The typical task is called
+Face Verification: given a pair of two pictures, a binary classifier
+must predict whether the two images are from the same person.
+
+An alternative task, Face Recognition or Face Identification is:
+given the picture of the face of an unknown person, identify the name
+of the person by referring to a gallery of previously seen pictures of
+identified persons.
+
+Both Face Verification and Face Recognition are tasks that are typically
+performed on the output of a model trained to perform Face Detection. The
+most popular model for Face Detection is called Viola-Jones and is
+implemented in the OpenCV library. The LFW faces were extracted by this
+face detector from various online websites.
+
+**Data Set Characteristics:**
+
+=================   =======================
+Classes                                5749
+Samples total                         13233
+Dimensionality                         5828
+Features            real, between 0 and 255
+=================   =======================
+
+.. dropdown:: Usage
+
+  ``scikit-learn`` provides two loaders that will automatically download,
+  cache, parse the metadata files, decode the jpeg and convert the
+  interesting slices into memmapped numpy arrays. This dataset size is more
+  than 200 MB. The first load typically takes more than a couple of minutes
+  to fully decode the relevant part of the JPEG files into numpy arrays. If
+  the dataset has  been loaded once, the following times the loading times
+  less than 200ms by using a memmapped version memoized on the disk in the
+  ``~/scikit_learn_data/lfw_home/`` folder using ``joblib``.
+
+  The first loader is used for the Face Identification task: a multi-class
+  classification task (hence supervised learning)::
+
+    >>> from sklearn.datasets import fetch_lfw_people
+    >>> lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
+
+    >>> for name in lfw_people.target_names:
+    ...     print(name)
+    ...
+    Ariel Sharon
+    Colin Powell
+    Donald Rumsfeld
+    George W Bush
+    Gerhard Schroeder
+    Hugo Chavez
+    Tony Blair
+
+  The default slice is a rectangular shape around the face, removing
+  most of the background::
+
+    >>> lfw_people.data.dtype
+    dtype('float32')
+
+    >>> lfw_people.data.shape
+    (1288, 1850)
+
+    >>> lfw_people.images.shape
+    (1288, 50, 37)
+
+  Each of the ``1140`` faces is assigned to a single person id in the ``target``
+  array::
+
+    >>> lfw_people.target.shape
+    (1288,)
+
+    >>> list(lfw_people.target[:10])
+    [5, 6, 3, 1, 0, 1, 3, 4, 3, 0]
+
+  The second loader is typically used for the face verification task: each sample
+  is a pair of two picture belonging or not to the same person::
+
+    >>> from sklearn.datasets import fetch_lfw_pairs
+    >>> lfw_pairs_train = fetch_lfw_pairs(subset='train')
+
+    >>> list(lfw_pairs_train.target_names)
+    ['Different persons', 'Same person']
+
+    >>> lfw_pairs_train.pairs.shape
+    (2200, 2, 62, 47)
+
+    >>> lfw_pairs_train.data.shape
+    (2200, 5828)
+
+    >>> lfw_pairs_train.target.shape
+    (2200,)
+
+  Both for the :func:`sklearn.datasets.fetch_lfw_people` and
+  :func:`sklearn.datasets.fetch_lfw_pairs` function it is
+  possible to get an additional dimension with the RGB color channels by
+  passing ``color=True``, in that case the shape will be
+  ``(2200, 2, 62, 47, 3)``.
+
+  The :func:`sklearn.datasets.fetch_lfw_pairs` datasets is subdivided into
+  3 subsets: the development ``train`` set, the development ``test`` set and
+  an evaluation ``10_folds`` set meant to compute performance metrics using a
+  10-folds cross validation scheme.
+
+.. rubric:: References
+
+* `Labeled Faces in the Wild: A Database for Studying Face Recognition
+  in Unconstrained Environments.
+  <http://vis-www.cs.umass.edu/lfw/lfw.pdf>`_
+  Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller.
+  University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.
+
+
+.. rubric:: Examples
+
+* :ref:`sphx_glr_auto_examples_applications_plot_face_recognition.py`
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/linnerud.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/linnerud.rst
@ -0,0 +1,24 @@
+.. _linnerrud_dataset:
+
+Linnerrud dataset
+-----------------
+
+**Data Set Characteristics:**
+
+:Number of Instances: 20
+:Number of Attributes: 3
+:Missing Attribute Values: None
+
+The Linnerud dataset is a multi-output regression dataset. It consists of three
+exercise (data) and three physiological (target) variables collected from
+twenty middle-aged men in a fitness club:
+
+- *physiological* - CSV containing 20 observations on 3 physiological variables:
+   Weight, Waist and Pulse.
+- *exercise* - CSV containing 20 observations on 3 exercise variables:
+   Chins, Situps and Jumps.
+
+.. dropdown:: References
+
+   * Tenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris:
+     Editions Technic.
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/olivetti_faces.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/olivetti_faces.rst
@ -0,0 +1,44 @@
+.. _olivetti_faces_dataset:
+
+The Olivetti faces dataset
+--------------------------
+
+`This dataset contains a set of face images`_ taken between April 1992 and
+April 1994 at AT&T Laboratories Cambridge. The
+:func:`sklearn.datasets.fetch_olivetti_faces` function is the data
+fetching / caching function that downloads the data
+archive from AT&T.
+
+.. _This dataset contains a set of face images: https://cam-orl.co.uk/facedatabase.html
+
+As described on the original website:
+
+    There are ten different images of each of 40 distinct subjects. For some
+    subjects, the images were taken at different times, varying the lighting,
+    facial expressions (open / closed eyes, smiling / not smiling) and facial
+    details (glasses / no glasses). All the images were taken against a dark
+    homogeneous background with the subjects in an upright, frontal position
+    (with tolerance for some side movement).
+
+**Data Set Characteristics:**
+
+=================   =====================
+Classes                                40
+Samples total                         400
+Dimensionality                       4096
+Features            real, between 0 and 1
+=================   =====================
+
+The image is quantized to 256 grey levels and stored as unsigned 8-bit
+integers; the loader will convert these to floating point values on the
+interval [0, 1], which are easier to work with for many algorithms.
+
+The "target" for this database is an integer from 0 to 39 indicating the
+identity of the person pictured; however, with only 10 examples per class, this
+relatively small dataset is more interesting from an unsupervised or
+semi-supervised perspective.
+
+The original dataset consisted of 92 x 112, while the version available here
+consists of 64x64 images.
+
+When using these images, please give credit to AT&T Laboratories Cambridge.
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/rcv1.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/rcv1.rst
@ -0,0 +1,72 @@
+.. _rcv1_dataset:
+
+RCV1 dataset
+------------
+
+Reuters Corpus Volume I (RCV1) is an archive of over 800,000 manually
+categorized newswire stories made available by Reuters, Ltd. for research
+purposes. The dataset is extensively described in [1]_.
+
+**Data Set Characteristics:**
+
+==============     =====================
+Classes                              103
+Samples total                     804414
+Dimensionality                     47236
+Features           real, between 0 and 1
+==============     =====================
+
+:func:`sklearn.datasets.fetch_rcv1` will load the following
+version: RCV1-v2, vectors, full sets, topics multilabels::
+
+    >>> from sklearn.datasets import fetch_rcv1
+    >>> rcv1 = fetch_rcv1()
+
+It returns a dictionary-like object, with the following attributes:
+
+``data``:
+The feature matrix is a scipy CSR sparse matrix, with 804414 samples and
+47236 features. Non-zero values contains cosine-normalized, log TF-IDF vectors.
+A nearly chronological split is proposed in [1]_: The first 23149 samples are
+the training set. The last 781265 samples are the testing set. This follows
+the official LYRL2004 chronological split. The array has 0.16% of non zero
+values::
+
+    >>> rcv1.data.shape
+    (804414, 47236)
+
+``target``:
+The target values are stored in a scipy CSR sparse matrix, with 804414 samples
+and 103 categories. Each sample has a value of 1 in its categories, and 0 in
+others. The array has 3.15% of non zero values::
+
+    >>> rcv1.target.shape
+    (804414, 103)
+
+``sample_id``:
+Each sample can be identified by its ID, ranging (with gaps) from 2286
+to 810596::
+
+    >>> rcv1.sample_id[:3]
+    array([2286, 2287, 2288], dtype=uint32)
+
+``target_names``:
+The target values are the topics of each sample. Each sample belongs to at
+least one topic, and to up to 17 topics. There are 103 topics, each
+represented by a string. Their corpus frequencies span five orders of
+magnitude, from 5 occurrences for 'GMIL', to 381327 for 'CCAT'::
+
+    >>> rcv1.target_names[:3].tolist()  # doctest: +SKIP
+    ['E11', 'ECAT', 'M11']
+
+The dataset will be downloaded from the `rcv1 homepage`_ if necessary.
+The compressed size is about 656 MB.
+
+.. _rcv1 homepage: http://jmlr.csail.mit.edu/papers/volume5/lewis04a/
+
+
+.. rubric:: References
+
+.. [1] Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004).
+       RCV1: A new benchmark collection for text categorization research.
+       The Journal of Machine Learning Research, 5, 361-397.
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/species_distributions.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/species_distributions.rst
@ -0,0 +1,36 @@
+.. _species_distribution_dataset:
+
+Species distribution dataset
+----------------------------
+
+This dataset represents the geographic distribution of two species in Central and
+South America. The two species are:
+
+- `"Bradypus variegatus" <http://www.iucnredlist.org/details/3038/0>`_ ,
+  the Brown-throated Sloth.
+
+ - `"Microryzomys minutus" <http://www.iucnredlist.org/details/13408/0>`_ ,
+   also known as the Forest Small Rice Rat, a rodent that lives in Peru,
+   Colombia, Ecuador, Peru, and Venezuela.
+
+The dataset is not a typical dataset since a :class:`~sklearn.datasets.base.Bunch`
+containing the attributes `data` and `target` is not returned. Instead, we have
+information allowing to create a "density" map of the different species.
+
+The grid for the map can be built using the attributes `x_left_lower_corner`,
+`y_left_lower_corner`, `Nx`, `Ny` and `grid_size`, which respectively correspond
+to the x and y coordinates of the lower left corner of the grid, the number of
+points along the x- and y-axis and the size of the step on the grid.
+
+The density at each location of the grid is contained in the `coverage` attribute.
+
+Finally, the `train` and `test` attributes contain information regarding the location
+of a species at a specific location.
+
+The dataset is provided by Phillips et. al. (2006).
+
+.. rubric:: References
+
+* `"Maximum entropy modeling of species geographic distributions"
+  <http://rob.schapire.net/papers/ecolmod.pdf>`_ S. J. Phillips,
+  R. P. Anderson, R. E. Schapire - Ecological Modelling, 190:231-259, 2006.
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/twenty_newsgroups.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/twenty_newsgroups.rst
@ -0,0 +1,248 @@
+.. _20newsgroups_dataset:
+
+The 20 newsgroups text dataset
+------------------------------
+
+The 20 newsgroups dataset comprises around 18000 newsgroups posts on
+20 topics split in two subsets: one for training (or development)
+and the other one for testing (or for performance evaluation). The split
+between the train and test set is based upon a messages posted before
+and after a specific date.
+
+This module contains two loaders. The first one,
+:func:`sklearn.datasets.fetch_20newsgroups`,
+returns a list of the raw texts that can be fed to text feature
+extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
+with custom parameters so as to extract feature vectors.
+The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
+returns ready-to-use features, i.e., it is not necessary to use a feature
+extractor.
+
+**Data Set Characteristics:**
+
+=================   ==========
+Classes                     20
+Samples total            18846
+Dimensionality               1
+Features                  text
+=================   ==========
+
+.. dropdown:: Usage
+
+  The :func:`sklearn.datasets.fetch_20newsgroups` function is a data
+  fetching / caching functions that downloads the data archive from
+  the original `20 newsgroups website <http://people.csail.mit.edu/jrennie/20Newsgroups/>`__,
+  extracts the archive contents
+  in the ``~/scikit_learn_data/20news_home`` folder and calls the
+  :func:`sklearn.datasets.load_files` on either the training or
+  testing set folder, or both of them::
+
+    >>> from sklearn.datasets import fetch_20newsgroups
+    >>> newsgroups_train = fetch_20newsgroups(subset='train')
+
+    >>> from pprint import pprint
+    >>> pprint(list(newsgroups_train.target_names))
+    ['alt.atheism',
+     'comp.graphics',
+     'comp.os.ms-windows.misc',
+     'comp.sys.ibm.pc.hardware',
+     'comp.sys.mac.hardware',
+     'comp.windows.x',
+     'misc.forsale',
+     'rec.autos',
+     'rec.motorcycles',
+     'rec.sport.baseball',
+     'rec.sport.hockey',
+     'sci.crypt',
+     'sci.electronics',
+     'sci.med',
+     'sci.space',
+     'soc.religion.christian',
+     'talk.politics.guns',
+     'talk.politics.mideast',
+     'talk.politics.misc',
+     'talk.religion.misc']
+
+  The real data lies in the ``filenames`` and ``target`` attributes. The target
+  attribute is the integer index of the category::
+
+    >>> newsgroups_train.filenames.shape
+    (11314,)
+    >>> newsgroups_train.target.shape
+    (11314,)
+    >>> newsgroups_train.target[:10]
+    array([ 7,  4,  4,  1, 14, 16, 13,  3,  2,  4])
+
+  It is possible to load only a sub-selection of the categories by passing the
+  list of the categories to load to the
+  :func:`sklearn.datasets.fetch_20newsgroups` function::
+
+    >>> cats = ['alt.atheism', 'sci.space']
+    >>> newsgroups_train = fetch_20newsgroups(subset='train', categories=cats)
+
+    >>> list(newsgroups_train.target_names)
+    ['alt.atheism', 'sci.space']
+    >>> newsgroups_train.filenames.shape
+    (1073,)
+    >>> newsgroups_train.target.shape
+    (1073,)
+    >>> newsgroups_train.target[:10]
+    array([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])
+
+.. dropdown:: Converting text to vectors
+
+  In order to feed predictive or clustering models with the text data,
+  one first need to turn the text into vectors of numerical values suitable
+  for statistical analysis. This can be achieved with the utilities of the
+  ``sklearn.feature_extraction.text`` as demonstrated in the following
+  example that extract `TF-IDF <https://en.wikipedia.org/wiki/Tf-idf>`__ vectors
+  of unigram tokens from a subset of 20news::
+
+    >>> from sklearn.feature_extraction.text import TfidfVectorizer
+    >>> categories = ['alt.atheism', 'talk.religion.misc',
+    ...               'comp.graphics', 'sci.space']
+    >>> newsgroups_train = fetch_20newsgroups(subset='train',
+    ...                                       categories=categories)
+    >>> vectorizer = TfidfVectorizer()
+    >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
+    >>> vectors.shape
+    (2034, 34118)
+
+  The extracted TF-IDF vectors are very sparse, with an average of 159 non-zero
+  components by sample in a more than 30000-dimensional space
+  (less than .5% non-zero features)::
+
+    >>> vectors.nnz / float(vectors.shape[0])
+    159.01327...
+
+  :func:`sklearn.datasets.fetch_20newsgroups_vectorized` is a function which
+  returns ready-to-use token counts features instead of file names.
+
+.. dropdown:: Filtering text for more realistic training
+
+  It is easy for a classifier to overfit on particular things that appear in the
+  20 Newsgroups data, such as newsgroup headers. Many classifiers achieve very
+  high F-scores, but their results would not generalize to other documents that
+  aren't from this window of time.
+
+  For example, let's look at the results of a multinomial Naive Bayes classifier,
+  which is fast to train and achieves a decent F-score::
+
+    >>> from sklearn.naive_bayes import MultinomialNB
+    >>> from sklearn import metrics
+    >>> newsgroups_test = fetch_20newsgroups(subset='test',
+    ...                                      categories=categories)
+    >>> vectors_test = vectorizer.transform(newsgroups_test.data)
+    >>> clf = MultinomialNB(alpha=.01)
+    >>> clf.fit(vectors, newsgroups_train.target)
+    MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
+
+    >>> pred = clf.predict(vectors_test)
+    >>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
+    0.88213...
+
+  (The example :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py` shuffles
+  the training and test data, instead of segmenting by time, and in that case
+  multinomial Naive Bayes gets a much higher F-score of 0.88. Are you suspicious
+  yet of what's going on inside this classifier?)
+
+  Let's take a look at what the most informative features are:
+
+    >>> import numpy as np
+    >>> def show_top10(classifier, vectorizer, categories):
+    ...     feature_names = vectorizer.get_feature_names_out()
+    ...     for i, category in enumerate(categories):
+    ...         top10 = np.argsort(classifier.coef_[i])[-10:]
+    ...         print("%s: %s" % (category, " ".join(feature_names[top10])))
+    ...
+    >>> show_top10(clf, vectorizer, newsgroups_train.target_names)
+    alt.atheism: edu it and in you that is of to the
+    comp.graphics: edu in graphics it is for and of to the
+    sci.space: edu it that is in and space to of the
+    talk.religion.misc: not it you in is that and to of the
+
+
+  You can now see many things that these features have overfit to:
+
+  - Almost every group is distinguished by whether headers such as
+    ``NNTP-Posting-Host:`` and ``Distribution:`` appear more or less often.
+  - Another significant feature involves whether the sender is affiliated with
+    a university, as indicated either by their headers or their signature.
+  - The word "article" is a significant feature, based on how often people quote
+    previous posts like this: "In article [article ID], [name] <[e-mail address]>
+    wrote:"
+  - Other features match the names and e-mail addresses of particular people who
+    were posting at the time.
+
+  With such an abundance of clues that distinguish newsgroups, the classifiers
+  barely have to identify topics from text at all, and they all perform at the
+  same high level.
+
+  For this reason, the functions that load 20 Newsgroups data provide a
+  parameter called **remove**, telling it what kinds of information to strip out
+  of each file. **remove** should be a tuple containing any subset of
+  ``('headers', 'footers', 'quotes')``, telling it to remove headers, signature
+  blocks, and quotation blocks respectively.
+
+    >>> newsgroups_test = fetch_20newsgroups(subset='test',
+    ...                                      remove=('headers', 'footers', 'quotes'),
+    ...                                      categories=categories)
+    >>> vectors_test = vectorizer.transform(newsgroups_test.data)
+    >>> pred = clf.predict(vectors_test)
+    >>> metrics.f1_score(pred, newsgroups_test.target, average='macro')
+    0.77310...
+
+  This classifier lost over a lot of its F-score, just because we removed
+  metadata that has little to do with topic classification.
+  It loses even more if we also strip this metadata from the training data:
+
+    >>> newsgroups_train = fetch_20newsgroups(subset='train',
+    ...                                       remove=('headers', 'footers', 'quotes'),
+    ...                                       categories=categories)
+    >>> vectors = vectorizer.fit_transform(newsgroups_train.data)
+    >>> clf = MultinomialNB(alpha=.01)
+    >>> clf.fit(vectors, newsgroups_train.target)
+    MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
+
+    >>> vectors_test = vectorizer.transform(newsgroups_test.data)
+    >>> pred = clf.predict(vectors_test)
+    >>> metrics.f1_score(newsgroups_test.target, pred, average='macro')
+    0.76995...
+
+  Some other classifiers cope better with this harder version of the task. Try the
+  :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py`
+  example with and without the `remove` option to compare the results.
+
+.. topic:: Data Considerations
+
+  The Cleveland Indians is a major league baseball team based in Cleveland,
+  Ohio, USA. In December 2020, it was reported that "After several months of
+  discussion sparked by the death of George Floyd and a national reckoning over
+  race and colonialism, the Cleveland Indians have decided to change their
+  name." Team owner Paul Dolan "did make it clear that the team will not make
+  its informal nickname -- the Tribe -- its new team name." "It's not going to
+  be a half-step away from the Indians," Dolan said."We will not have a Native
+  American-themed name."
+
+  https://www.mlb.com/news/cleveland-indians-team-name-change
+
+.. topic:: Recommendation
+
+  - When evaluating text classifiers on the 20 Newsgroups data, you
+    should strip newsgroup-related metadata. In scikit-learn, you can do this
+    by setting ``remove=('headers', 'footers', 'quotes')``. The F-score will be
+    lower because it is more realistic.
+  - This text dataset contains data which may be inappropriate for certain NLP
+    applications. An example is listed in the "Data Considerations" section
+    above. The challenge with using current text datasets in NLP for tasks such
+    as sentence completion, clustering, and other applications is that text
+    that is culturally biased and inflammatory will propagate biases. This
+    should be taken into consideration when using the dataset, reviewing the
+    output, and the bias should be documented.
+
+.. rubric:: Examples
+
+* :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_text_feature_extraction.py`
+* :ref:`sphx_glr_auto_examples_text_plot_document_classification_20newsgroups.py`
+* :ref:`sphx_glr_auto_examples_text_plot_hashing_vs_dict_vectorizer.py`
+* :ref:`sphx_glr_auto_examples_text_plot_document_clustering.py`
--- a/venv/lib/python3.11/site-packages/sklearn/datasets/descr/wine_data.rst
+++ b/venv/lib/python3.11/site-packages/sklearn/datasets/descr/wine_data.rst
@ -0,0 +1,94 @@
+.. _wine_dataset:
+
+Wine recognition dataset
+------------------------
+
+**Data Set Characteristics:**
+
+:Number of Instances: 178
+:Number of Attributes: 13 numeric, predictive attributes and the class
+:Attribute Information:
+    - Alcohol
+    - Malic acid
+    - Ash
+    - Alcalinity of ash
+    - Magnesium
+    - Total phenols
+    - Flavanoids
+    - Nonflavanoid phenols
+    - Proanthocyanins
+    - Color intensity
+    - Hue
+    - OD280/OD315 of diluted wines
+    - Proline
+    - class:
+        - class_0
+        - class_1
+        - class_2
+
+:Summary Statistics:
+
+============================= ==== ===== ======= =====
+                                Min   Max   Mean     SD
+============================= ==== ===== ======= =====
+Alcohol:                      11.0  14.8    13.0   0.8
+Malic Acid:                   0.74  5.80    2.34  1.12
+Ash:                          1.36  3.23    2.36  0.27
+Alcalinity of Ash:            10.6  30.0    19.5   3.3
+Magnesium:                    70.0 162.0    99.7  14.3
+Total Phenols:                0.98  3.88    2.29  0.63
+Flavanoids:                   0.34  5.08    2.03  1.00
+Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
+Proanthocyanins:              0.41  3.58    1.59  0.57
+Colour Intensity:              1.3  13.0     5.1   2.3
+Hue:                          0.48  1.71    0.96  0.23
+OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
+Proline:                       278  1680     746   315
+============================= ==== ===== ======= =====
+
+:Missing Attribute Values: None
+:Class Distribution: class_0 (59), class_1 (71), class_2 (48)
+:Creator: R.A. Fisher
+:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
+:Date: July, 1988
+
+This is a copy of UCI ML Wine recognition datasets.
+https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
+
+The data is the results of a chemical analysis of wines grown in the same
+region in Italy by three different cultivators. There are thirteen different
+measurements taken for different constituents found in the three types of
+wine.
+
+Original Owners:
+
+Forina, M. et al, PARVUS -
+An Extendible Package for Data Exploration, Classification and Correlation.
+Institute of Pharmaceutical and Food Analysis and Technologies,
+Via Brigata Salerno, 16147 Genoa, Italy.
+
+Citation:
+
+Lichman, M. (2013). UCI Machine Learning Repository
+[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
+School of Information and Computer Science.
+
+.. dropdown:: References
+
+    (1) S. Aeberhard, D. Coomans and O. de Vel,
+    Comparison of Classifiers in High Dimensional Settings,
+    Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
+    Mathematics and Statistics, James Cook University of North Queensland.
+    (Also submitted to Technometrics).
+
+    The data was used with many others for comparing various
+    classifiers. The classes are separable, though only RDA
+    has achieved 100% correct classification.
+    (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
+    (All results using the leave-one-out technique)
+
+    (2) S. Aeberhard, D. Coomans and O. de Vel,
+    "THE CLASSIFICATION PERFORMANCE OF RDA"
+    Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
+    Mathematics and Statistics, James Cook University of North Queensland.
+    (Also submitted to Journal of Chemometrics).