{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "mP9aApBqQJed" }, "source": [ "Creando rutinas de preprocesamiento con Scikit-Learn\n", "====================================================" ] }, { "cell_type": "markdown", "metadata": { "id": "2V8sUvroQJej" }, "source": [ "## Introducción\n", "\n", "Las tareas de preprocesamiento de datos pueden ser largas y tediosas. Sin embargo, esto no es el único problema. Si no se realizan de la forma correcta, en muchos casos se pueden introducir problemas de modelado que son dificiles de detectar y que invalidan cualquier técnica de aprendizaje automático que utilicemos luego.\n", "\n", "En esta sección aprenderemos varias formas de realizar preprocesamiento y finalmente una forma de empaquetar estas transformaciones de forma que luego sean reproducibles." ] }, { "cell_type": "markdown", "metadata": { "id": "tRmZrmz7QJel" }, "source": [ "### Instalación" ] }, { "cell_type": "markdown", "metadata": { "id": "3Yj6mHMaQJel" }, "source": [ "Necesitaremos instalar las librerias:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "IeIXxuSaQJem" }, "outputs": [], "source": [ "!pip install ydata_profiling cloudpickle scikit-learn pandas numpy --quiet" ] }, { "cell_type": "markdown", "metadata": { "id": "tTLrI09WQJeo" }, "source": [ "### Sobre el conjunto de datos del censo UCI\n", "\n", "El conjunto de datos del censo de la UCI es un conjunto de datos en el que cada registro representa a una persona. Cada registro contiene 14 columnas que describen a una una sola persona, de la base de datos del censo de Estados Unidos de 1994. Esto incluye información como la edad, el estado civil y el nivel educativo. La tarea es determinar si una persona tiene un ingreso alto (definido como ganar más de $50 mil al año). Esta tarea, dado el tipo de datos que utiliza, se usa a menudo en el estudio de equidad, en parte debido a los atributos comprensibles del conjunto de datos, incluidos algunos que contienen tipos sensibles como la edad y el género, y en parte también porque comprende una tarea claramente del mundo real." ] }, { "cell_type": "markdown", "metadata": { "id": "vWyomYxTQJep" }, "source": [ "Descargamos el conjunto de datos" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "ovck-xIsQJeq" }, "outputs": [], "source": [ "!wget https://santiagxf.blob.core.windows.net/public/datasets/uci_census.zip \\\n", " --quiet --no-clobber\n", "!mkdir -p datasets/uci_census\n", "!unzip -qq uci_census.zip -d datasets/uci_census" ] }, { "cell_type": "markdown", "metadata": { "id": "EbVjMnWWQJeq" }, "source": [ "Lo importamos" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "K3qSdhVHQJeq" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "train = pd.read_csv('datasets/uci_census/data/adult-train.csv')\n", "test = pd.read_csv('datasets/uci_census/data/adult-test.csv')" ] }, { "cell_type": "code", "source": [ "train" ], "metadata": { "id": "6GyGTqzJzaaY", "outputId": "e6733a70-c43a-4d66-e1cf-7f8740291010", "colab": { "base_uri": "https://localhost:8080/", "height": 562 } }, "execution_count": 6, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ " income age workclass fnlwgt education education-num \\\n", "0 <=50K 39 State-gov 77516 Bachelors 13 \n", "1 <=50K 50 Self-emp-not-inc 83311 Bachelors 13 \n", "2 <=50K 38 Private 215646 HS-grad 9 \n", "3 <=50K 53 Private 234721 11th 7 \n", "4 <=50K 28 Private 338409 Bachelors 13 \n", "... ... ... ... ... ... ... \n", "32556 <=50K 27 Private 257302 Assoc-acdm 12 \n", "32557 >50K 40 Private 154374 HS-grad 9 \n", "32558 <=50K 58 Private 151910 HS-grad 9 \n", "32559 <=50K 22 Private 201490 HS-grad 9 \n", "32560 >50K 52 Self-emp-inc 287927 HS-grad 9 \n", "\n", " marital-status occupation relationship race \\\n", "0 Never-married Adm-clerical Not-in-family White \n", "1 Married-civ-spouse Exec-managerial Husband White \n", "2 Divorced Handlers-cleaners Not-in-family White \n", "3 Married-civ-spouse Handlers-cleaners Husband Black \n", "4 Married-civ-spouse Prof-specialty Wife Black \n", "... ... ... ... ... \n", "32556 Married-civ-spouse Tech-support Wife White \n", "32557 Married-civ-spouse Machine-op-inspct Husband White \n", "32558 Widowed Adm-clerical Unmarried White \n", "32559 Never-married Adm-clerical Own-child White \n", "32560 Married-civ-spouse Exec-managerial Wife White \n", "\n", " gender capital-gain capital-loss hours-per-week native-country \n", "0 Male 2174 0 40 United-States \n", "1 Male 0 0 13 United-States \n", "2 Male 0 0 40 United-States \n", "3 Male 0 0 40 United-States \n", "4 Female 0 0 40 Cuba \n", "... ... ... ... ... ... \n", "32556 Female 0 0 38 United-States \n", "32557 Male 0 0 40 United-States \n", "32558 Female 0 0 40 United-States \n", "32559 Male 0 0 20 United-States \n", "32560 Female 15024 0 40 United-States \n", "\n", "[32561 rows x 15 columns]" ], "text/html": [ "\n", "
| \n", " | income | \n", "age | \n", "workclass | \n", "fnlwgt | \n", "education | \n", "education-num | \n", "marital-status | \n", "occupation | \n", "relationship | \n", "race | \n", "gender | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "native-country | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "<=50K | \n", "39 | \n", "State-gov | \n", "77516 | \n", "Bachelors | \n", "13 | \n", "Never-married | \n", "Adm-clerical | \n", "Not-in-family | \n", "White | \n", "Male | \n", "2174 | \n", "0 | \n", "40 | \n", "United-States | \n", "
| 1 | \n", "<=50K | \n", "50 | \n", "Self-emp-not-inc | \n", "83311 | \n", "Bachelors | \n", "13 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "13 | \n", "United-States | \n", "
| 2 | \n", "<=50K | \n", "38 | \n", "Private | \n", "215646 | \n", "HS-grad | \n", "9 | \n", "Divorced | \n", "Handlers-cleaners | \n", "Not-in-family | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "
| 3 | \n", "<=50K | \n", "53 | \n", "Private | \n", "234721 | \n", "11th | \n", "7 | \n", "Married-civ-spouse | \n", "Handlers-cleaners | \n", "Husband | \n", "Black | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "
| 4 | \n", "<=50K | \n", "28 | \n", "Private | \n", "338409 | \n", "Bachelors | \n", "13 | \n", "Married-civ-spouse | \n", "Prof-specialty | \n", "Wife | \n", "Black | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "Cuba | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 32556 | \n", "<=50K | \n", "27 | \n", "Private | \n", "257302 | \n", "Assoc-acdm | \n", "12 | \n", "Married-civ-spouse | \n", "Tech-support | \n", "Wife | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "38 | \n", "United-States | \n", "
| 32557 | \n", ">50K | \n", "40 | \n", "Private | \n", "154374 | \n", "HS-grad | \n", "9 | \n", "Married-civ-spouse | \n", "Machine-op-inspct | \n", "Husband | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "
| 32558 | \n", "<=50K | \n", "58 | \n", "Private | \n", "151910 | \n", "HS-grad | \n", "9 | \n", "Widowed | \n", "Adm-clerical | \n", "Unmarried | \n", "White | \n", "Female | \n", "0 | \n", "0 | \n", "40 | \n", "United-States | \n", "
| 32559 | \n", "<=50K | \n", "22 | \n", "Private | \n", "201490 | \n", "HS-grad | \n", "9 | \n", "Never-married | \n", "Adm-clerical | \n", "Own-child | \n", "White | \n", "Male | \n", "0 | \n", "0 | \n", "20 | \n", "United-States | \n", "
| 32560 | \n", ">50K | \n", "52 | \n", "Self-emp-inc | \n", "287927 | \n", "HS-grad | \n", "9 | \n", "Married-civ-spouse | \n", "Exec-managerial | \n", "Wife | \n", "White | \n", "Female | \n", "15024 | \n", "0 | \n", "40 | \n", "United-States | \n", "
32561 rows × 15 columns
\n", "\n", " Improve your data and profiling with ydata-sdk, featuring data quality scoring, redundancy detection, outlier identification, text validation, and synthetic data generation.\n", "
\n", "| \n", " | income | \n", "age | \n", "fnlwgt | \n", "education-num | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "workclass_Federal-gov | \n", "workclass_Local-gov | \n", "workclass_Never-worked | \n", "... | \n", "native-country_Portugal | \n", "native-country_Puerto-Rico | \n", "native-country_Scotland | \n", "native-country_South | \n", "native-country_Taiwan | \n", "native-country_Thailand | \n", "native-country_Trinadad&Tobago | \n", "native-country_United-States | \n", "native-country_Vietnam | \n", "native-country_Yugoslavia | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "<=50K | \n", "0.030390 | \n", "-1.063569 | \n", "1.134777 | \n", "2.830199 | \n", "-0.22116 | \n", "-0.035664 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 1 | \n", "<=50K | \n", "0.836973 | \n", "-1.008668 | \n", "1.134777 | \n", "-0.299391 | \n", "-0.22116 | \n", "-2.222483 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 2 | \n", "<=50K | \n", "-0.042936 | \n", "0.245040 | \n", "-0.420679 | \n", "-0.299391 | \n", "-0.22116 | \n", "-0.035664 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 3 | \n", "<=50K | \n", "1.056950 | \n", "0.425752 | \n", "-1.198407 | \n", "-0.299391 | \n", "-0.22116 | \n", "-0.035664 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 4 | \n", "<=50K | \n", "-0.776193 | \n", "1.408066 | \n", "1.134777 | \n", "-0.299391 | \n", "-0.22116 | \n", "-0.035664 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "
5 rows × 101 columns
\n", "| \n", " | age | \n", "fnlwgt | \n", "education-num | \n", "capital-gain | \n", "capital-loss | \n", "hours-per-week | \n", "workclass_Federal-gov | \n", "workclass_Local-gov | \n", "workclass_Never-worked | \n", "workclass_Private | \n", "... | \n", "native-country_Portugal | \n", "native-country_Puerto-Rico | \n", "native-country_Scotland | \n", "native-country_South | \n", "native-country_Taiwan | \n", "native-country_Thailand | \n", "native-country_Trinadad&Tobago | \n", "native-country_United-States | \n", "native-country_Vietnam | \n", "native-country_Yugoslavia | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "-0.995706 | \n", "0.350774 | \n", "-1.197459 | \n", "-0.299271 | \n", "-0.221075 | \n", "-0.035429 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 1 | \n", "-0.042642 | \n", "-0.947095 | \n", "-0.420060 | \n", "-0.299271 | \n", "-0.221075 | \n", "0.774468 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 2 | \n", "-0.775768 | \n", "1.394362 | \n", "0.746039 | \n", "-0.299271 | \n", "-0.221075 | \n", "-0.035429 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 3 | \n", "0.397233 | \n", "-0.279070 | \n", "-0.031360 | \n", "3.345796 | \n", "-0.221075 | \n", "-0.035429 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 4 | \n", "-1.508894 | \n", "-0.817458 | \n", "-0.031360 | \n", "-0.299271 | \n", "-0.221075 | \n", "-0.845327 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 16276 | \n", "0.030671 | \n", "0.242928 | \n", "1.134739 | \n", "-0.299271 | \n", "-0.221075 | \n", "-0.359389 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 16277 | \n", "1.863485 | \n", "1.247055 | \n", "-0.420060 | \n", "-0.299271 | \n", "-0.221075 | \n", "-0.035429 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 16278 | \n", "-0.042642 | \n", "1.754690 | \n", "1.134739 | \n", "-0.299271 | \n", "-0.221075 | \n", "0.774468 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 16279 | \n", "0.397233 | \n", "-1.003212 | \n", "1.134739 | \n", "3.206033 | \n", "-0.221075 | \n", "-0.035429 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
| 16280 | \n", "-0.262580 | \n", "-0.072293 | \n", "1.134739 | \n", "-0.299271 | \n", "-0.221075 | \n", "1.584366 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "1.0 | \n", "0.0 | \n", "0.0 | \n", "
16281 rows × 100 columns
\n", "