Načíst data
V jakém formátu jsou sloupce?
Kolik je řádků a sloupců?
Obsahují některé sloupce chybějící data?
Udělat popisnou statistiku dat
# Importujeme pandas jako zkrátku
# Není to nutné, ale většina tutorialu to tak má
import pandas as pd
data = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/DAAG/primates.csv')
# Zobrazir první 2 řádky
data.head(n=2)
| Unnamed: 0 | Bodywt | Brainwt | |
|---|---|---|---|
| 0 | Potar monkey | 10.0 | 115 |
| 1 | Gorilla | 207.0 | 406 |
# Zapsát náhodné 3 řádky do CSV souboru
data1 = data.sample(n=3)
data1.to_csv('data1.csv')
# Zjistit datový typ tabulky
type(data)
pandas.core.frame.DataFrame
# Zjistit datový typ sloupce
type(data.Bodywt)
pandas.core.series.Series
primates = pd.read_csv('primates.csv')
primates.head()
| Unnamed: 0 | Bodywt | Brainwt | |
|---|---|---|---|
| 0 | Potar monkey | 10.0 | 115 |
| 1 | Gorilla | 207.0 | 406 |
| 2 | Human | 62.0 | 1320 |
| 3 | Rhesus monkey | 6.8 | 179 |
| 4 | Chimp | 52.2 | 440 |
# Zobrazíme index tabulky
primates.index
RangeIndex(start=0, stop=5, step=1)
# Zobrazíme názvy sloupců tabulky
data.columns
Index(['Unnamed: 0', 'Bodywt', 'Brainwt'], dtype='object')
Načtěte data tips.csv (jsou ve stejné složce) a použijte první sloupec jako index
Bonus: Načtete data z UNICEF pomocí read_excel funkce z pandas a přeskočte prvních 9 řádků
tips = pd.read_csv('tips.csv')
tips.head()
| Unnamed: 0 | total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 2 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 3 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 4 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 5 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
tips = pd.read_csv('tips.csv', index_col=0)
tips.head()
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 1 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 2 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 3 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 4 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 5 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
# Bonus načtení Excelu
unicef = pd.read_excel('Out-of-School-Rate-Final-for-website.xlsx',
skiprows=9
)
unicef.head()
| ISO3 Code | Countries | Total | Unnamed: 3 | Male | Unnamed: 5 | Female | Unnamed: 7 | Urban | Unnamed: 9 | ... | Poorer | Unnamed: 15 | Middle | Unnamed: 17 | Fourth | Unnamed: 19 | Richest | Unnamed: 21 | Source | Unnamed: 23 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AFG | Afghanistan | 36.3 | NaN | 26.9 | NaN | 46.8 | NaN | 19.7 | NaN | ... | 46.2 | NaN | 44.4 | NaN | 31.8 | NaN | 16.7 | NaN | DHS 2015 | NaN |
| 1 | ALB | Albania | 7.60232 | x | 7.76783 | x | 7.41396 | x | 6.66002 | x | ... | 6.45801 | x | 8.8039 | x | 6.7463 | x | 6.39511 | x | DHS 2008-09 | NaN |
| 2 | DZA | Algeria | 2.19084 | NaN | 2.1955 | NaN | 2.18592 | NaN | 1.94815 | NaN | ... | 2.57707 | NaN | 1.7089 | NaN | 1.65181 | NaN | 1.42334 | NaN | MICS 2012-13 | NaN |
| 3 | AND | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | AGO | Angola | 23.7 | y | 22.8 | y | 24.6 | y | 15.1 | y | ... | 25.8 | y | 22.5 | y | 14.4 | y | 9.7 | y | IBEP 2008-09 | NaN |
5 rows × 24 columns
# Zobrazíme všechny sloupce
unicef.columns
Index(['ISO3 Code', 'Countries', 'Total', 'Unnamed: 3', 'Male', 'Unnamed: 5',
'Female', 'Unnamed: 7', 'Urban', 'Unnamed: 9', 'Rural', 'Unnamed: 11',
'Poorest', 'Unnamed: 13', 'Poorer', 'Unnamed: 15', 'Middle',
'Unnamed: 17', 'Fourth', 'Unnamed: 19', 'Richest', 'Unnamed: 21',
'Source', 'Unnamed: 23'],
dtype='object')
# Zahřívací příklad:
# nový seznam obsahuje všechny prvky původního seznamu
my_list = [1,2,3,4,5,6]
[item for item in my_list]
[1, 2, 3, 4, 5, 6]
# Nový seznam obsahuje všechny prvky původního,
# ale každý prvk je vynásoben 100
[item*100 for item in my_list]
[100, 200, 300, 400, 500, 600]
# Nový seznam obsahuje jenom ty prvky seznamu, které jsou sudé
[item for item in my_list if item%2 == 0]
[2, 4, 6]
# Obdobným způsobem vytvoříme seznam názvu sloupců,
# které obsahují 'Unnamed' (částečná shoda)
# To jsou sloupce, které chceme vyhodit
col_to_drop = [
col_name for col_name in unicef.columns if 'Unnamed' in col_name
]
col_to_drop
['Unnamed: 3', 'Unnamed: 5', 'Unnamed: 7', 'Unnamed: 9', 'Unnamed: 11', 'Unnamed: 13', 'Unnamed: 15', 'Unnamed: 17', 'Unnamed: 19', 'Unnamed: 21', 'Unnamed: 23']
# Odstraníme z tabulky seznam sloupců,
# které jsme vytvořili v předešlém kroku
unicef.drop(col_to_drop, axis='columns', inplace=True)
unicef.head()
| ISO3 Code | Countries | Total | Male | Female | Urban | Rural | Poorest | Poorer | Middle | Fourth | Richest | Source | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AFG | Afghanistan | 36.3 | 26.9 | 46.8 | 19.7 | 41.4 | 40.6 | 46.2 | 44.4 | 31.8 | 16.7 | DHS 2015 |
| 1 | ALB | Albania | 7.60232 | 7.76783 | 7.41396 | 6.66002 | 8.25574 | 8.8999 | 6.45801 | 8.8039 | 6.7463 | 6.39511 | DHS 2008-09 |
| 2 | DZA | Algeria | 2.19084 | 2.1955 | 2.18592 | 1.94815 | 2.61943 | 3.4101 | 2.57707 | 1.7089 | 1.65181 | 1.42334 | MICS 2012-13 |
| 3 | AND | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | AGO | Angola | 23.7 | 22.8 | 24.6 | 15.1 | 33.2 | 37.2 | 25.8 | 22.5 | 14.4 | 9.7 | IBEP 2008-09 |
primates = pd.read_csv('primates.csv')
# Zobrazíme názvy sloupců
primates.columns
Index(['Unnamed: 0', 'Bodywt', 'Brainwt'], dtype='object')
# inplace=True znamená, že se pozmění současná tabulka
primates.rename(
columns={'Unnamed: 0': 'Species'},
inplace=True
)
primates.head()
| Species | Bodywt | Brainwt | |
|---|---|---|---|
| 0 | Potar monkey | 10.0 | 115 |
| 1 | Gorilla | 207.0 | 406 |
| 2 | Human | 62.0 | 1320 |
| 3 | Rhesus monkey | 6.8 | 179 |
| 4 | Chimp | 52.2 | 440 |
titanic3.xls pomocí read_excel funkce z pandaspclass na passenger_class a home.dest na home_destinationtitanic = pd.read_excel('titanic3.xls')
titanic.head()
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
| 1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
| 2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
| 3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
| 4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
titanic.columns
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
dtype='object')
titanic.rename(
columns={
'pclass': 'passenger_class',
'home.dest':'home_destination'
},
inplace=True
)
titanic.head()
| passenger_class | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home_destination | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
| 1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
| 2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
| 3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
| 4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
primates.dtypes
Species object Bodywt float64 Brainwt int64 dtype: object
titanic.dtypes
passenger_class int64 survived int64 name object sex object age float64 sibsp int64 parch int64 ticket object fare float64 cabin object embarked object boat object body float64 home_destination object dtype: object
# Počet řádků a počet sloupců
primates.shape
(5, 3)
# Počet řádků
primates.shape[0]
5
# Počet sloupců
primates.shape[1]
3
# Počet řádků
len(primates)
5
titanic3.xls pomocí read_excel funkce z pandastitanic.shape[0]
1309
titanic.shape[1]
14
# Načteme data z Excelu
titanic = pd.read_excel('titanic3.xls')
titanic.head()
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
| 1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
| 2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
| 3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
| 4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
# Jaké sloupce má tabulka?
titanic.columns
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
dtype='object')
# Má tabulká chybějící data?
titanic.isnull().values.any()
True
# Zjistíme, které sloupce mají chybějící data
titanic.isnull().any()
pclass False survived False name False sex False age True sibsp False parch False ticket False fare True cabin True embarked True boat True body True home.dest True dtype: bool
# Zobrazíme počet chybějících hodnot v každém sloupci
titanic.isnull().sum()
pclass 0 survived 0 name 0 sex 0 age 263 sibsp 0 parch 0 ticket 0 fare 1 cabin 1014 embarked 2 boat 823 body 1188 home.dest 564 dtype: int64
YouthRisk2009.csvBonus: Zjistěte kolík řádků má chybějící data (správná odpověď je 72)
youth_risk = pd.read_csv('YouthRisk2009.csv', index_col=0)
youth_risk.head()
| Sleep | Sleep7 | SmokeLife | SmokeDaily | MarijuaEver | Age | |
|---|---|---|---|---|---|---|
| 1 | NaN | NaN | No | NaN | 0.0 | 16 |
| 2 | 8 hours | 1.0 | Yes | Yes | 1.0 | 16 |
| 3 | 5 hours | 0.0 | Yes | Yes | 1.0 | 17 |
| 4 | 5 hours | 0.0 | Yes | Yes | 1.0 | 18 |
| 5 | 7 hours | 1.0 | Yes | No | 1.0 | 17 |
# Nastavíme maximální počet řádků, které se zobrazí při náhledu
pd.options.display.max_rows = 100
youth_risk.isnull().any(axis=1)
1 True
2 False
3 False
4 False
5 False
6 False
7 False
8 False
9 False
10 False
11 False
12 False
13 False
14 False
15 False
16 False
17 False
18 False
19 False
20 False
21 False
22 False
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 False
31 False
32 True
33 False
34 False
35 False
36 True
37 False
38 False
39 False
40 False
41 False
42 False
43 False
44 False
45 False
46 False
47 False
48 True
49 False
50 False
...
451 False
452 False
453 False
454 False
455 False
456 False
457 False
458 False
459 False
460 False
461 False
462 True
463 False
464 False
465 False
466 False
467 False
468 False
469 False
470 False
471 False
472 False
473 True
474 False
475 False
476 False
477 False
478 False
479 False
480 False
481 False
482 False
483 False
484 False
485 False
486 False
487 False
488 False
489 False
490 False
491 False
492 False
493 False
494 False
495 False
496 False
497 False
498 False
499 False
500 False
Length: 500, dtype: bool
# Počet řádků s chybějícími hodnotami
youth_risk.isnull().any(axis=1).sum()
72
Pandas má 2 způsoby:
iloc vybírá podle pozic (indexů)loc vybírá podle jmen# Načteme data
primates = pd.read_csv('primates.csv')
primates.rename(index=str, columns={'Unnamed: 0': 'Species'}, inplace=True)
primates.head()
| Species | Bodywt | Brainwt | |
|---|---|---|---|
| 0 | Potar monkey | 10.0 | 115 |
| 1 | Gorilla | 207.0 | 406 |
| 2 | Human | 62.0 | 1320 |
| 3 | Rhesus monkey | 6.8 | 179 |
| 4 | Chimp | 52.2 | 440 |
# Vybereme 0. a 1. řádky
primates.iloc[0:2]
| Species | Bodywt | Brainwt | |
|---|---|---|---|
| 0 | Potar monkey | 10.0 | 115 |
| 1 | Gorilla | 207.0 | 406 |
# Vybereme 0. a 1. řádky a 1. a 2. sloupce
primates.iloc[0:2, [0,2]]
| Species | Brainwt | |
|---|---|---|
| 0 | Potar monkey | 115 |
| 1 | Gorilla | 406 |
# Vybíráme podle jména
# POZOR: číselný index musí být převeden řetězec
primates.loc[:'2', ['Species', 'Bodywt']]
| Species | Bodywt | |
|---|---|---|
| 0 | Potar monkey | 10.0 |
| 1 | Gorilla | 207.0 |
| 2 | Human | 62.0 |
# Vybereme sloupec
primates.Bodywt
0 10.0 1 207.0 2 62.0 3 6.8 4 52.2 Name: Bodywt, dtype: float64
# Vybereme sloupec - druhy způsob
primates['Bodywt']
0 10.0 1 207.0 2 62.0 3 6.8 4 52.2 Name: Bodywt, dtype: float64
# Lze specifikovat, který sloupec je index
# Index se v tomto případě skládá z řetězců
primates2 = pd.read_csv(
'primates.csv',
index_col=0
)
primates2.head()
| Bodywt | Brainwt | |
|---|---|---|
| Potar monkey | 10.0 | 115 |
| Gorilla | 207.0 | 406 |
| Human | 62.0 | 1320 |
| Rhesus monkey | 6.8 | 179 |
| Chimp | 52.2 | 440 |
primates2.loc['Potar monkey']
Bodywt 10.0 Brainwt 115.0 Name: Potar monkey, dtype: float64
titanicname)Bonus: Vyberte řádky 10 až 550 a následující sloupce: jméno (name), cena jízdenky (fare), zda osoba přežila (survived)
titanic.iloc[:20]
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
| 1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
| 2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
| 3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
| 4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
| 5 | 1 | 1 | Anderson, Mr. Harry | male | 48.0000 | 0 | 0 | 19952 | 26.5500 | E12 | S | 3 | NaN | New York, NY |
| 6 | 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.0000 | 1 | 0 | 13502 | 77.9583 | D7 | S | 10 | NaN | Hudson, NY |
| 7 | 1 | 0 | Andrews, Mr. Thomas Jr | male | 39.0000 | 0 | 0 | 112050 | 0.0000 | A36 | S | NaN | NaN | Belfast, NI |
| 8 | 1 | 1 | Appleton, Mrs. Edward Dale (Charlotte Lamson) | female | 53.0000 | 2 | 0 | 11769 | 51.4792 | C101 | S | D | NaN | Bayside, Queens, NY |
| 9 | 1 | 0 | Artagaveytia, Mr. Ramon | male | 71.0000 | 0 | 0 | PC 17609 | 49.5042 | NaN | C | NaN | 22.0 | Montevideo, Uruguay |
| 10 | 1 | 0 | Astor, Col. John Jacob | male | 47.0000 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C | NaN | 124.0 | New York, NY |
| 11 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18.0000 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C | 4 | NaN | New York, NY |
| 12 | 1 | 1 | Aubart, Mme. Leontine Pauline | female | 24.0000 | 0 | 0 | PC 17477 | 69.3000 | B35 | C | 9 | NaN | Paris, France |
| 13 | 1 | 1 | Barber, Miss. Ellen "Nellie" | female | 26.0000 | 0 | 0 | 19877 | 78.8500 | NaN | S | 6 | NaN | NaN |
| 14 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80.0000 | 0 | 0 | 27042 | 30.0000 | A23 | S | B | NaN | Hessle, Yorks |
| 15 | 1 | 0 | Baumann, Mr. John D | male | NaN | 0 | 0 | PC 17318 | 25.9250 | NaN | S | NaN | NaN | New York, NY |
| 16 | 1 | 0 | Baxter, Mr. Quigg Edmond | male | 24.0000 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | NaN | NaN | Montreal, PQ |
| 17 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50.0000 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | 6 | NaN | Montreal, PQ |
| 18 | 1 | 1 | Bazzani, Miss. Albina | female | 32.0000 | 0 | 0 | 11813 | 76.2917 | D15 | C | 8 | NaN | NaN |
| 19 | 1 | 0 | Beattie, Mr. Thomson | male | 36.0000 | 0 | 0 | 13050 | 75.2417 | C6 | C | A | NaN | Winnipeg, MN |
titanic.iloc[:20, 2]
0 Allen, Miss. Elisabeth Walton 1 Allison, Master. Hudson Trevor 2 Allison, Miss. Helen Loraine 3 Allison, Mr. Hudson Joshua Creighton 4 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 5 Anderson, Mr. Harry 6 Andrews, Miss. Kornelia Theodosia 7 Andrews, Mr. Thomas Jr 8 Appleton, Mrs. Edward Dale (Charlotte Lamson) 9 Artagaveytia, Mr. Ramon 10 Astor, Col. John Jacob 11 Astor, Mrs. John Jacob (Madeleine Talmadge Force) 12 Aubart, Mme. Leontine Pauline 13 Barber, Miss. Ellen "Nellie" 14 Barkworth, Mr. Algernon Henry Wilson 15 Baumann, Mr. John D 16 Baxter, Mr. Quigg Edmond 17 Baxter, Mrs. James (Helene DeLaudeniere Chaput) 18 Bazzani, Miss. Albina 19 Beattie, Mr. Thomson Name: name, dtype: object
titanic.iloc[10:551, [1,2,8]]
| survived | name | fare | |
|---|---|---|---|
| 10 | 0 | Astor, Col. John Jacob | 227.5250 |
| 11 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | 227.5250 |
| 12 | 1 | Aubart, Mme. Leontine Pauline | 69.3000 |
| 13 | 1 | Barber, Miss. Ellen "Nellie" | 78.8500 |
| 14 | 1 | Barkworth, Mr. Algernon Henry Wilson | 30.0000 |
| 15 | 0 | Baumann, Mr. John D | 25.9250 |
| 16 | 0 | Baxter, Mr. Quigg Edmond | 247.5208 |
| 17 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | 247.5208 |
| 18 | 1 | Bazzani, Miss. Albina | 76.2917 |
| 19 | 0 | Beattie, Mr. Thomson | 75.2417 |
| 20 | 1 | Beckwith, Mr. Richard Leonard | 52.5542 |
| 21 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | 52.5542 |
| 22 | 1 | Behr, Mr. Karl Howell | 30.0000 |
| 23 | 1 | Bidois, Miss. Rosalie | 227.5250 |
| 24 | 1 | Bird, Miss. Ellen | 221.7792 |
| 25 | 0 | Birnbaum, Mr. Jakob | 26.0000 |
| 26 | 1 | Bishop, Mr. Dickinson H | 91.0792 |
| 27 | 1 | Bishop, Mrs. Dickinson H (Helen Walton) | 91.0792 |
| 28 | 1 | Bissette, Miss. Amelia | 135.6333 |
| 29 | 1 | Bjornstrom-Steffansson, Mr. Mauritz Hakan | 26.5500 |
| 30 | 0 | Blackwell, Mr. Stephen Weart | 35.5000 |
| 31 | 1 | Blank, Mr. Henry | 31.0000 |
| 32 | 1 | Bonnell, Miss. Caroline | 164.8667 |
| 33 | 1 | Bonnell, Miss. Elizabeth | 26.5500 |
| 34 | 0 | Borebank, Mr. John James | 26.5500 |
| 35 | 1 | Bowen, Miss. Grace Scott | 262.3750 |
| 36 | 1 | Bowerman, Miss. Elsie Edith | 55.0000 |
| 37 | 1 | Bradley, Mr. George ("George Arthur Brayton") | 26.5500 |
| 38 | 0 | Brady, Mr. John Bertram | 30.5000 |
| 39 | 0 | Brandeis, Mr. Emil | 50.4958 |
| 40 | 0 | Brewe, Dr. Arthur Jackson | 39.6000 |
| 41 | 1 | Brown, Mrs. James Joseph (Margaret Tobin) | 27.7208 |
| 42 | 1 | Brown, Mrs. John Murray (Caroline Lane Lamson) | 51.4792 |
| 43 | 1 | Bucknell, Mrs. William Robert (Emma Eliza Ward) | 76.2917 |
| 44 | 1 | Burns, Miss. Elizabeth Margaret | 134.5000 |
| 45 | 0 | Butt, Major. Archibald Willingham | 26.5500 |
| 46 | 0 | Cairns, Mr. Alexander | 31.0000 |
| 47 | 1 | Calderhead, Mr. Edward Pennington | 26.2875 |
| 48 | 1 | Candee, Mrs. Edward (Helen Churchill Hungerford) | 27.4458 |
| 49 | 1 | Cardeza, Mr. Thomas Drake Martinez | 512.3292 |
| 50 | 1 | Cardeza, Mrs. James Warburton Martinez (Charlo... | 512.3292 |
| 51 | 0 | Carlsson, Mr. Frans Olof | 5.0000 |
| 52 | 0 | Carrau, Mr. Francisco M | 47.1000 |
| 53 | 0 | Carrau, Mr. Jose Pedro | 47.1000 |
| 54 | 1 | Carter, Master. William Thornton II | 120.0000 |
| 55 | 1 | Carter, Miss. Lucile Polk | 120.0000 |
| 56 | 1 | Carter, Mr. William Ernest | 120.0000 |
| 57 | 1 | Carter, Mrs. William Ernest (Lucile Polk) | 120.0000 |
| 58 | 0 | Case, Mr. Howard Brown | 26.0000 |
| 59 | 1 | Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev... | 27.7208 |
| ... | ... | ... | ... |
| 501 | 1 | Mellinger, Miss. Madeleine Violet | 19.5000 |
| 502 | 1 | Mellinger, Mrs. (Elizabeth Anne Maidment) | 19.5000 |
| 503 | 1 | Mellors, Mr. William John | 10.5000 |
| 504 | 0 | Meyer, Mr. August | 13.0000 |
| 505 | 0 | Milling, Mr. Jacob Christian | 13.0000 |
| 506 | 0 | Mitchell, Mr. Henry Michael | 10.5000 |
| 507 | 0 | Montvila, Rev. Juozas | 13.0000 |
| 508 | 0 | Moraweck, Dr. Ernest | 14.0000 |
| 509 | 0 | Morley, Mr. Henry Samuel ("Mr Henry Marshall") | 26.0000 |
| 510 | 0 | Mudd, Mr. Thomas Charles | 10.5000 |
| 511 | 0 | Myles, Mr. Thomas Francis | 9.6875 |
| 512 | 0 | Nasser, Mr. Nicholas | 30.0708 |
| 513 | 1 | Nasser, Mrs. Nicholas (Adele Achem) | 30.0708 |
| 514 | 1 | Navratil, Master. Edmond Roger | 26.0000 |
| 515 | 1 | Navratil, Master. Michel M | 26.0000 |
| 516 | 0 | Navratil, Mr. Michel ("Louis M Hoffman") | 26.0000 |
| 517 | 0 | Nesson, Mr. Israel | 13.0000 |
| 518 | 0 | Nicholls, Mr. Joseph Charles | 36.7500 |
| 519 | 0 | Norman, Mr. Robert Douglas | 13.5000 |
| 520 | 1 | Nourney, Mr. Alfred ("Baron von Drachstedt") | 13.8625 |
| 521 | 1 | Nye, Mrs. (Elizabeth Ramell) | 10.5000 |
| 522 | 0 | Otter, Mr. Richard | 13.0000 |
| 523 | 1 | Oxenham, Mr. Percy Thomas | 10.5000 |
| 524 | 1 | Padro y Manent, Mr. Julian | 13.8625 |
| 525 | 0 | Pain, Dr. Alfred | 10.5000 |
| 526 | 1 | Pallas y Castello, Mr. Emilio | 13.8583 |
| 527 | 0 | Parker, Mr. Clifford Richard | 10.5000 |
| 528 | 0 | Parkes, Mr. Francis "Frank" | 0.0000 |
| 529 | 1 | Parrish, Mrs. (Lutie Davis) | 26.0000 |
| 530 | 0 | Pengelly, Mr. Frederick William | 10.5000 |
| 531 | 0 | Pernot, Mr. Rene | 15.0500 |
| 532 | 0 | Peruschitz, Rev. Joseph Maria | 13.0000 |
| 533 | 1 | Phillips, Miss. Alice Frances Louisa | 21.0000 |
| 534 | 1 | Phillips, Miss. Kate Florence ("Mrs Kate Louis... | 26.0000 |
| 535 | 0 | Phillips, Mr. Escott Robert | 21.0000 |
| 536 | 1 | Pinsky, Mrs. (Rosa) | 13.0000 |
| 537 | 0 | Ponesell, Mr. Martin | 13.0000 |
| 538 | 1 | Portaluppi, Mr. Emilio Ilario Giuseppe | 12.7375 |
| 539 | 0 | Pulbaum, Mr. Franz | 15.0333 |
| 540 | 1 | Quick, Miss. Phyllis May | 26.0000 |
| 541 | 1 | Quick, Miss. Winifred Vera | 26.0000 |
| 542 | 1 | Quick, Mrs. Frederick Charles (Jane Richards) | 26.0000 |
| 543 | 0 | Reeves, Mr. David | 10.5000 |
| 544 | 0 | Renouf, Mr. Peter Henry | 21.0000 |
| 545 | 1 | Renouf, Mrs. Peter Henry (Lillian Jefferys) | 21.0000 |
| 546 | 1 | Reynaldo, Ms. Encarnacion | 13.0000 |
| 547 | 0 | Richard, Mr. Emile | 15.0458 |
| 548 | 1 | Richards, Master. George Sibley | 18.7500 |
| 549 | 1 | Richards, Master. William Rowe | 18.7500 |
| 550 | 1 | Richards, Mrs. Sidney (Emily Hocking) | 18.7500 |
541 rows × 3 columns
titanic.loc[10:550, ['name', 'fare', 'survived']]
| name | fare | survived | |
|---|---|---|---|
| 10 | Astor, Col. John Jacob | 227.5250 | 0 |
| 11 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | 227.5250 | 1 |
| 12 | Aubart, Mme. Leontine Pauline | 69.3000 | 1 |
| 13 | Barber, Miss. Ellen "Nellie" | 78.8500 | 1 |
| 14 | Barkworth, Mr. Algernon Henry Wilson | 30.0000 | 1 |
| 15 | Baumann, Mr. John D | 25.9250 | 0 |
| 16 | Baxter, Mr. Quigg Edmond | 247.5208 | 0 |
| 17 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | 247.5208 | 1 |
| 18 | Bazzani, Miss. Albina | 76.2917 | 1 |
| 19 | Beattie, Mr. Thomson | 75.2417 | 0 |
| 20 | Beckwith, Mr. Richard Leonard | 52.5542 | 1 |
| 21 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | 52.5542 | 1 |
| 22 | Behr, Mr. Karl Howell | 30.0000 | 1 |
| 23 | Bidois, Miss. Rosalie | 227.5250 | 1 |
| 24 | Bird, Miss. Ellen | 221.7792 | 1 |
| 25 | Birnbaum, Mr. Jakob | 26.0000 | 0 |
| 26 | Bishop, Mr. Dickinson H | 91.0792 | 1 |
| 27 | Bishop, Mrs. Dickinson H (Helen Walton) | 91.0792 | 1 |
| 28 | Bissette, Miss. Amelia | 135.6333 | 1 |
| 29 | Bjornstrom-Steffansson, Mr. Mauritz Hakan | 26.5500 | 1 |
| 30 | Blackwell, Mr. Stephen Weart | 35.5000 | 0 |
| 31 | Blank, Mr. Henry | 31.0000 | 1 |
| 32 | Bonnell, Miss. Caroline | 164.8667 | 1 |
| 33 | Bonnell, Miss. Elizabeth | 26.5500 | 1 |
| 34 | Borebank, Mr. John James | 26.5500 | 0 |
| 35 | Bowen, Miss. Grace Scott | 262.3750 | 1 |
| 36 | Bowerman, Miss. Elsie Edith | 55.0000 | 1 |
| 37 | Bradley, Mr. George ("George Arthur Brayton") | 26.5500 | 1 |
| 38 | Brady, Mr. John Bertram | 30.5000 | 0 |
| 39 | Brandeis, Mr. Emil | 50.4958 | 0 |
| 40 | Brewe, Dr. Arthur Jackson | 39.6000 | 0 |
| 41 | Brown, Mrs. James Joseph (Margaret Tobin) | 27.7208 | 1 |
| 42 | Brown, Mrs. John Murray (Caroline Lane Lamson) | 51.4792 | 1 |
| 43 | Bucknell, Mrs. William Robert (Emma Eliza Ward) | 76.2917 | 1 |
| 44 | Burns, Miss. Elizabeth Margaret | 134.5000 | 1 |
| 45 | Butt, Major. Archibald Willingham | 26.5500 | 0 |
| 46 | Cairns, Mr. Alexander | 31.0000 | 0 |
| 47 | Calderhead, Mr. Edward Pennington | 26.2875 | 1 |
| 48 | Candee, Mrs. Edward (Helen Churchill Hungerford) | 27.4458 | 1 |
| 49 | Cardeza, Mr. Thomas Drake Martinez | 512.3292 | 1 |
| 50 | Cardeza, Mrs. James Warburton Martinez (Charlo... | 512.3292 | 1 |
| 51 | Carlsson, Mr. Frans Olof | 5.0000 | 0 |
| 52 | Carrau, Mr. Francisco M | 47.1000 | 0 |
| 53 | Carrau, Mr. Jose Pedro | 47.1000 | 0 |
| 54 | Carter, Master. William Thornton II | 120.0000 | 1 |
| 55 | Carter, Miss. Lucile Polk | 120.0000 | 1 |
| 56 | Carter, Mr. William Ernest | 120.0000 | 1 |
| 57 | Carter, Mrs. William Ernest (Lucile Polk) | 120.0000 | 1 |
| 58 | Case, Mr. Howard Brown | 26.0000 | 0 |
| 59 | Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev... | 27.7208 | 1 |
| ... | ... | ... | ... |
| 501 | Mellinger, Miss. Madeleine Violet | 19.5000 | 1 |
| 502 | Mellinger, Mrs. (Elizabeth Anne Maidment) | 19.5000 | 1 |
| 503 | Mellors, Mr. William John | 10.5000 | 1 |
| 504 | Meyer, Mr. August | 13.0000 | 0 |
| 505 | Milling, Mr. Jacob Christian | 13.0000 | 0 |
| 506 | Mitchell, Mr. Henry Michael | 10.5000 | 0 |
| 507 | Montvila, Rev. Juozas | 13.0000 | 0 |
| 508 | Moraweck, Dr. Ernest | 14.0000 | 0 |
| 509 | Morley, Mr. Henry Samuel ("Mr Henry Marshall") | 26.0000 | 0 |
| 510 | Mudd, Mr. Thomas Charles | 10.5000 | 0 |
| 511 | Myles, Mr. Thomas Francis | 9.6875 | 0 |
| 512 | Nasser, Mr. Nicholas | 30.0708 | 0 |
| 513 | Nasser, Mrs. Nicholas (Adele Achem) | 30.0708 | 1 |
| 514 | Navratil, Master. Edmond Roger | 26.0000 | 1 |
| 515 | Navratil, Master. Michel M | 26.0000 | 1 |
| 516 | Navratil, Mr. Michel ("Louis M Hoffman") | 26.0000 | 0 |
| 517 | Nesson, Mr. Israel | 13.0000 | 0 |
| 518 | Nicholls, Mr. Joseph Charles | 36.7500 | 0 |
| 519 | Norman, Mr. Robert Douglas | 13.5000 | 0 |
| 520 | Nourney, Mr. Alfred ("Baron von Drachstedt") | 13.8625 | 1 |
| 521 | Nye, Mrs. (Elizabeth Ramell) | 10.5000 | 1 |
| 522 | Otter, Mr. Richard | 13.0000 | 0 |
| 523 | Oxenham, Mr. Percy Thomas | 10.5000 | 1 |
| 524 | Padro y Manent, Mr. Julian | 13.8625 | 1 |
| 525 | Pain, Dr. Alfred | 10.5000 | 0 |
| 526 | Pallas y Castello, Mr. Emilio | 13.8583 | 1 |
| 527 | Parker, Mr. Clifford Richard | 10.5000 | 0 |
| 528 | Parkes, Mr. Francis "Frank" | 0.0000 | 0 |
| 529 | Parrish, Mrs. (Lutie Davis) | 26.0000 | 1 |
| 530 | Pengelly, Mr. Frederick William | 10.5000 | 0 |
| 531 | Pernot, Mr. Rene | 15.0500 | 0 |
| 532 | Peruschitz, Rev. Joseph Maria | 13.0000 | 0 |
| 533 | Phillips, Miss. Alice Frances Louisa | 21.0000 | 1 |
| 534 | Phillips, Miss. Kate Florence ("Mrs Kate Louis... | 26.0000 | 1 |
| 535 | Phillips, Mr. Escott Robert | 21.0000 | 0 |
| 536 | Pinsky, Mrs. (Rosa) | 13.0000 | 1 |
| 537 | Ponesell, Mr. Martin | 13.0000 | 0 |
| 538 | Portaluppi, Mr. Emilio Ilario Giuseppe | 12.7375 | 1 |
| 539 | Pulbaum, Mr. Franz | 15.0333 | 0 |
| 540 | Quick, Miss. Phyllis May | 26.0000 | 1 |
| 541 | Quick, Miss. Winifred Vera | 26.0000 | 1 |
| 542 | Quick, Mrs. Frederick Charles (Jane Richards) | 26.0000 | 1 |
| 543 | Reeves, Mr. David | 10.5000 | 0 |
| 544 | Renouf, Mr. Peter Henry | 21.0000 | 0 |
| 545 | Renouf, Mrs. Peter Henry (Lillian Jefferys) | 21.0000 | 1 |
| 546 | Reynaldo, Ms. Encarnacion | 13.0000 | 1 |
| 547 | Richard, Mr. Emile | 15.0458 | 0 |
| 548 | Richards, Master. George Sibley | 18.7500 | 1 |
| 549 | Richards, Master. William Rowe | 18.7500 | 1 |
| 550 | Richards, Mrs. Sidney (Emily Hocking) | 18.7500 | 1 |
541 rows × 3 columns
Primáti:
primates
| Species | Bodywt | Brainwt | |
|---|---|---|---|
| 0 | Potar monkey | 10.0 | 115 |
| 1 | Gorilla | 207.0 | 406 |
| 2 | Human | 62.0 | 1320 |
| 3 | Rhesus monkey | 6.8 | 179 |
| 4 | Chimp | 52.2 | 440 |
brainwt_kg = primates.Brainwt / 1000
brainwt_kg
0 0.115 1 0.406 2 1.320 3 0.179 4 0.440 Name: Brainwt, dtype: float64
brainwt_kg / primates.Bodywt
0 0.011500 1 0.001961 2 0.021290 3 0.026324 4 0.008429 dtype: float64
# První způsob vytvoření nového sloupce
primates['ratio'] = brainwt_kg / primates.Bodywt
primates
| Species | Bodywt | Brainwt | ratio | |
|---|---|---|---|---|
| 0 | Potar monkey | 10.0 | 115 | 0.011500 |
| 1 | Gorilla | 207.0 | 406 | 0.001961 |
| 2 | Human | 62.0 | 1320 | 0.021290 |
| 3 | Rhesus monkey | 6.8 | 179 | 0.026324 |
| 4 | Chimp | 52.2 | 440 | 0.008429 |
# První způsob vytvoření nového sloupce
primates.assign(ratio2=(brainwt_kg / primates.Bodywt), inplace=True)
primates
| Species | Bodywt | Brainwt | ratio | |
|---|---|---|---|---|
| 0 | Potar monkey | 10.0 | 115 | 0.011500 |
| 1 | Gorilla | 207.0 | 406 | 0.001961 |
| 2 | Human | 62.0 | 1320 | 0.021290 |
| 3 | Rhesus monkey | 6.8 | 179 | 0.026324 |
| 4 | Chimp | 52.2 | 440 | 0.008429 |
# Nevektorové oprerace - iterujeme řádky tabulky
for row_index, row in primates.iterrows():
print(row.Bodywt)
10.0 207.0 62.0 6.8 52.2
primates2
| Bodywt | Brainwt | |
|---|---|---|
| Potar monkey | 10.0 | 115 |
| Gorilla | 207.0 | 406 |
| Human | 62.0 | 1320 |
| Rhesus monkey | 6.8 | 179 |
| Chimp | 52.2 | 440 |
primates2.columns
Index(['Bodywt', 'Brainwt'], dtype='object')
primates2.Bodywt
Potar monkey 10.0 Gorilla 207.0 Human 62.0 Rhesus monkey 6.8 Chimp 52.2 Name: Bodywt, dtype: float64
primates2.Bodywt > 60
Potar monkey False Gorilla True Human True Rhesus monkey False Chimp False Name: Bodywt, dtype: bool
# ~ znamená negace
~(primates2.Brainwt < 500)
Potar monkey False Gorilla False Human True Rhesus monkey False Chimp False Name: Brainwt, dtype: bool
# | znamená nebo
primates2[(primates2.Bodywt > 60) | (primates2.Brainwt >= 500)]
| Bodywt | Brainwt | |
|---|---|---|
| Gorilla | 207.0 | 406 |
| Human | 62.0 | 1320 |
# Vyberéme řádky, kde tělesná hmota je více než 60
primates2[primates2.Bodywt > 60]
| Bodywt | Brainwt | |
|---|---|---|
| Gorilla | 207.0 | 406 |
| Human | 62.0 | 1320 |
# Vyberéme řádky, kde je tělesná hmota více než 60 a
# hmota mozku je méně než 500
primates2[(primates2.Bodywt > 60) & (primates2.Brainwt < 500)]
| Bodywt | Brainwt | |
|---|---|---|
| Gorilla | 207.0 | 406 |
titanicBonus: Pro vybrané řádky spočítejte průměrnou cenu jízdenky (fare) (správná odpověď je cca 51)
sum(titanic.sex == 'female')
466
len(titanic[titanic.sex == 'female'])
466
titanic[titanic.sex == 'female'].shape[0]
466
sum((titanic.age > 1) & (titanic.sex == 'female'))
379
titanic[(titanic.age > 1) & (titanic.sex == 'female')].fare.mean()
51.017481530343005
Použijeme metodu groupby, která seskupí řádky dle jednoho nebo více sloupců.
Potom použijeme nějakou metodu nad seskupenými řádky (count, mean, sum).
# Načteme data z Excelu
data = pd.read_excel('titanic3.xls')
data.head()
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
| 1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
| 2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
| 3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
| 4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
# Groupby vrací iterátor
groups = titanic.groupby('sex')
# Unikátní hodnoty dle kterých je rozdělena tabulka
groups.groups.keys()
dict_keys(['female', 'male'])
# Seskupíme řádky dle pohlaví a spočítáme průměrnou hodnotu všech číselných proměnných
data.groupby('sex').mean()
| pclass | survived | age | sibsp | parch | fare | body | |
|---|---|---|---|---|---|---|---|
| sex | |||||||
| female | 2.154506 | 0.727468 | 28.687071 | 0.652361 | 0.633047 | 46.198097 | 166.62500 |
| male | 2.372479 | 0.190985 | 30.585233 | 0.413998 | 0.247924 | 26.154601 | 160.39823 |
# Totéž můžeme udělat pomocí filtrování
titanic[titanic.sex == 'female'].mean()
pclass 2.154506 survived 0.727468 age 28.687071 sibsp 0.652361 parch 0.633047 fare 46.198097 body 166.625000 dtype: float64
titanic[titanic.sex == 'male'].mean()
pclass 2.372479 survived 0.190985 age 30.585233 sibsp 0.413998 parch 0.247924 fare 26.154601 body 160.398230 dtype: float64
# Ve smyčce vytiskneme průměrné hodnoty pro cestující obou pohlaví
for sex in titanic.sex.unique():
print(sex, titanic[titanic.sex == sex].mean())
female pclass 2.154506 survived 0.727468 age 28.687071 sibsp 0.652361 parch 0.633047 fare 46.198097 body 166.625000 dtype: float64 male pclass 2.372479 survived 0.190985 age 30.585233 sibsp 0.413998 parch 0.247924 fare 26.154601 body 160.398230 dtype: float64
# Seskupení dle více proměnných - pohlaví a místa nalodění
titanic.groupby(['sex', 'embarked', 'survived']).mean()
| pclass | age | sibsp | parch | fare | body | |||
|---|---|---|---|---|---|---|---|---|
| sex | embarked | survived | ||||||
| female | C | 0 | 2.636364 | 27.071429 | 0.454545 | 0.454545 | 17.460991 | 328.000000 |
| 1 | 1.539216 | 31.543956 | 0.509804 | 0.519608 | 87.994648 | NaN | ||
| Q | 0 | 3.000000 | 26.692308 | 0.086957 | 0.347826 | 9.311961 | 229.000000 | |
| 1 | 2.837838 | 24.000000 | 0.270270 | 0.000000 | 14.564192 | NaN | ||
| S | 0 | 2.806452 | 24.842105 | 1.333333 | 1.075269 | 26.117341 | 79.500000 | |
| 1 | 1.924242 | 29.104167 | 0.560606 | 0.651515 | 45.549622 | NaN | ||
| male | C | 0 | 2.119266 | 35.178082 | 0.302752 | 0.201835 | 42.555965 | 155.083333 |
| 1 | 1.729167 | 29.912602 | 0.375000 | 0.416667 | 63.013896 | NaN | ||
| Q | 0 | 2.875000 | 32.104167 | 0.500000 | 0.107143 | 12.561384 | 134.750000 | |
| 1 | 3.000000 | 25.000000 | 0.285714 | 0.000000 | 9.974400 | NaN | ||
| S | 0 | 2.460348 | 30.855829 | 0.444874 | 0.241779 | 20.722284 | 163.105882 | |
| 1 | 2.188679 | 25.712862 | 0.358491 | 0.339623 | 27.291978 | NaN |
Seskupte titanic dle třídy (pclass) a pohlaví (sex) zobrazte medián (median)
Bonus: zobrazte jenom sloupec věk (age)
titanic.groupby(['pclass', 'sex']).median()
| survived | age | sibsp | parch | fare | body | ||
|---|---|---|---|---|---|---|---|
| pclass | sex | ||||||
| 1 | female | 1 | 36.0 | 0 | 0 | 80.92915 | NaN |
| male | 0 | 42.0 | 0 | 0 | 49.50420 | 166.0 | |
| 2 | female | 1 | 28.0 | 0 | 0 | 23.00000 | 52.0 |
| male | 0 | 29.5 | 0 | 0 | 13.00000 | 160.0 | |
| 3 | female | 0 | 22.0 | 0 | 0 | 10.48960 | 206.0 |
| male | 0 | 25.0 | 0 | 0 | 7.89580 | 148.0 |
# Pokud seskupojeme podle 2 a více proměnných, výsledná tabulka
# bude mít několikaúrovňový index
titanic.groupby(['pclass', 'sex']).median()['age'].index
MultiIndex(levels=[[1, 2, 3], ['female', 'male']],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['pclass', 'sex'])
titanic.groupby(['pclass',]).median()['age'].index
Int64Index([1, 2, 3], dtype='int64', name='pclass')
titanic.groupby(['pclass', 'sex']).median().age
pclass sex
1 female 36.0
male 42.0
2 female 28.0
male 29.5
3 female 22.0
male 25.0
Name: age, dtype: float64
# Bez include='all' jenom numerické sloupce
titanic.describe(include='all')
| pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1309.000000 | 1309.000000 | 1309 | 1309 | 1046.000000 | 1309.000000 | 1309.000000 | 1309 | 1308.000000 | 295 | 1307 | 486 | 121.000000 | 745 |
| unique | NaN | NaN | 1307 | 2 | NaN | NaN | NaN | 939 | NaN | 186 | 3 | 28 | NaN | 369 |
| top | NaN | NaN | Connolly, Miss. Kate | male | NaN | NaN | NaN | CA. 2343 | NaN | C23 C25 C27 | S | 13 | NaN | New York, NY |
| freq | NaN | NaN | 2 | 843 | NaN | NaN | NaN | 11 | NaN | 6 | 914 | 39 | NaN | 64 |
| mean | 2.294882 | 0.381971 | NaN | NaN | 29.881135 | 0.498854 | 0.385027 | NaN | 33.295479 | NaN | NaN | NaN | 160.809917 | NaN |
| std | 0.837836 | 0.486055 | NaN | NaN | 14.413500 | 1.041658 | 0.865560 | NaN | 51.758668 | NaN | NaN | NaN | 97.696922 | NaN |
| min | 1.000000 | 0.000000 | NaN | NaN | 0.166700 | 0.000000 | 0.000000 | NaN | 0.000000 | NaN | NaN | NaN | 1.000000 | NaN |
| 25% | 2.000000 | 0.000000 | NaN | NaN | 21.000000 | 0.000000 | 0.000000 | NaN | 7.895800 | NaN | NaN | NaN | 72.000000 | NaN |
| 50% | 3.000000 | 0.000000 | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | NaN | 14.454200 | NaN | NaN | NaN | 155.000000 | NaN |
| 75% | 3.000000 | 1.000000 | NaN | NaN | 39.000000 | 1.000000 | 0.000000 | NaN | 31.275000 | NaN | NaN | NaN | 256.000000 | NaN |
| max | 3.000000 | 1.000000 | NaN | NaN | 80.000000 | 8.000000 | 9.000000 | NaN | 512.329200 | NaN | NaN | NaN | 328.000000 | NaN |
# Bez include='all' jenom numerické sloupce
titanic.describe()
| pclass | survived | age | sibsp | parch | fare | body | |
|---|---|---|---|---|---|---|---|
| count | 1309.000000 | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 | 121.000000 |
| mean | 2.294882 | 0.381971 | 29.881135 | 0.498854 | 0.385027 | 33.295479 | 160.809917 |
| std | 0.837836 | 0.486055 | 14.413500 | 1.041658 | 0.865560 | 51.758668 | 97.696922 |
| min | 1.000000 | 0.000000 | 0.166700 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| 25% | 2.000000 | 0.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 | 72.000000 |
| 50% | 3.000000 | 0.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 155.000000 |
| 75% | 3.000000 | 1.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 | 256.000000 |
| max | 3.000000 | 1.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 | 328.000000 |
Použijte následující postup a seznamte se s těmito daty, popis dat
Načíst data
V jakém formátu jsou sloupce?
Kolik je řádků a sloupců?
Obsahují některé sloupce chybějící data?
Udělat popisnou statistiku dat
education = pd.read_csv(
'http://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv',
index_col=0
)
education.head()
| State | Region | X1 | X2 | X3 | Y | |
|---|---|---|---|---|---|---|
| 1 | ME | 1 | 508 | 3944 | 325 | 235 |
| 2 | NH | 1 | 564 | 4578 | 323 | 231 |
| 3 | VT | 1 | 322 | 4011 | 328 | 270 |
| 4 | MA | 1 | 846 | 5233 | 305 | 261 |
| 5 | RI | 1 | 871 | 4780 | 303 | 300 |
education.rename(columns={
'X1': 'residents', 'X2': 'income',
'X3': 'residents_young', 'Y': 'expenditure'
},
inplace=True
)
education.head()
| State | Region | residents | income | residents_young | expenditure | |
|---|---|---|---|---|---|---|
| 1 | ME | 1 | 508 | 3944 | 325 | 235 |
| 2 | NH | 1 | 564 | 4578 | 323 | 231 |
| 3 | VT | 1 | 322 | 4011 | 328 | 270 |
| 4 | MA | 1 | 846 | 5233 | 305 | 261 |
| 5 | RI | 1 | 871 | 4780 | 303 | 300 |
education.dtypes
State object Region int64 residents int64 income int64 residents_young int64 expenditure int64 dtype: object
education.shape
(50, 6)
education.isnull().any()
State False Region False residents False income False residents_young False expenditure False dtype: bool
education.describe(include='all')
| State | Region | residents | income | residents_young | expenditure | |
|---|---|---|---|---|---|---|
| count | 50 | 50.000000 | 50.000000 | 50.000000 | 50.000000 | 50.000000 |
| unique | 50 | NaN | NaN | NaN | NaN | NaN |
| top | CO | NaN | NaN | NaN | NaN | NaN |
| freq | 1 | NaN | NaN | NaN | NaN | NaN |
| mean | NaN | 2.660000 | 657.800000 | 4675.120000 | 325.740000 | 284.600000 |
| std | NaN | 1.061574 | 145.016396 | 644.506254 | 19.423119 | 61.340136 |
| min | NaN | 1.000000 | 322.000000 | 3448.000000 | 287.000000 | 208.000000 |
| 25% | NaN | 2.000000 | 546.750000 | 4137.250000 | 310.750000 | 234.250000 |
| 50% | NaN | 3.000000 | 662.500000 | 4706.000000 | 324.500000 | 269.500000 |
| 75% | NaN | 3.750000 | 782.250000 | 5054.250000 | 333.000000 | 316.750000 |
| max | NaN | 4.000000 | 909.000000 | 5889.000000 | 386.000000 | 546.000000 |