Načíst data
V jakém formátu jsou sloupce?
Kolik je řádků a sloupců?
Obsahují některé sloupce chybějící data?
Udělat popisnou statistiku dat
# Importujeme pandas jako zkrátku
# Není to nutné, ale většina tutorialu to tak má
import pandas as pd
data = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/DAAG/primates.csv')
# Zobrazir první 2 řádky
data.head(n=2)
Unnamed: 0 | Bodywt | Brainwt | |
---|---|---|---|
0 | Potar monkey | 10.0 | 115 |
1 | Gorilla | 207.0 | 406 |
# Zapsát náhodné 3 řádky do CSV souboru
data1 = data.sample(n=3)
data1.to_csv('data1.csv')
# Zjistit datový typ tabulky
type(data)
pandas.core.frame.DataFrame
# Zjistit datový typ sloupce
type(data.Bodywt)
pandas.core.series.Series
primates = pd.read_csv('primates.csv')
primates.head()
Unnamed: 0 | Bodywt | Brainwt | |
---|---|---|---|
0 | Potar monkey | 10.0 | 115 |
1 | Gorilla | 207.0 | 406 |
2 | Human | 62.0 | 1320 |
3 | Rhesus monkey | 6.8 | 179 |
4 | Chimp | 52.2 | 440 |
# Zobrazíme index tabulky
primates.index
RangeIndex(start=0, stop=5, step=1)
# Zobrazíme názvy sloupců tabulky
data.columns
Index(['Unnamed: 0', 'Bodywt', 'Brainwt'], dtype='object')
Načtěte data tips.csv
(jsou ve stejné složce) a použijte první sloupec jako index
Bonus: Načtete data z UNICEF pomocí read_excel
funkce z pandas a přeskočte prvních 9 řádků
tips = pd.read_csv('tips.csv')
tips.head()
Unnamed: 0 | total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 2 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 3 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 4 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 5 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
tips = pd.read_csv('tips.csv', index_col=0)
tips.head()
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
1 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
2 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
3 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
4 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
5 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
# Bonus načtení Excelu
unicef = pd.read_excel('Out-of-School-Rate-Final-for-website.xlsx',
skiprows=9
)
unicef.head()
ISO3 Code | Countries | Total | Unnamed: 3 | Male | Unnamed: 5 | Female | Unnamed: 7 | Urban | Unnamed: 9 | ... | Poorer | Unnamed: 15 | Middle | Unnamed: 17 | Fourth | Unnamed: 19 | Richest | Unnamed: 21 | Source | Unnamed: 23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFG | Afghanistan | 36.3 | NaN | 26.9 | NaN | 46.8 | NaN | 19.7 | NaN | ... | 46.2 | NaN | 44.4 | NaN | 31.8 | NaN | 16.7 | NaN | DHS 2015 | NaN |
1 | ALB | Albania | 7.60232 | x | 7.76783 | x | 7.41396 | x | 6.66002 | x | ... | 6.45801 | x | 8.8039 | x | 6.7463 | x | 6.39511 | x | DHS 2008-09 | NaN |
2 | DZA | Algeria | 2.19084 | NaN | 2.1955 | NaN | 2.18592 | NaN | 1.94815 | NaN | ... | 2.57707 | NaN | 1.7089 | NaN | 1.65181 | NaN | 1.42334 | NaN | MICS 2012-13 | NaN |
3 | AND | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | AGO | Angola | 23.7 | y | 22.8 | y | 24.6 | y | 15.1 | y | ... | 25.8 | y | 22.5 | y | 14.4 | y | 9.7 | y | IBEP 2008-09 | NaN |
5 rows × 24 columns
# Zobrazíme všechny sloupce
unicef.columns
Index(['ISO3 Code', 'Countries', 'Total', 'Unnamed: 3', 'Male', 'Unnamed: 5', 'Female', 'Unnamed: 7', 'Urban', 'Unnamed: 9', 'Rural', 'Unnamed: 11', 'Poorest', 'Unnamed: 13', 'Poorer', 'Unnamed: 15', 'Middle', 'Unnamed: 17', 'Fourth', 'Unnamed: 19', 'Richest', 'Unnamed: 21', 'Source', 'Unnamed: 23'], dtype='object')
# Zahřívací příklad:
# nový seznam obsahuje všechny prvky původního seznamu
my_list = [1,2,3,4,5,6]
[item for item in my_list]
[1, 2, 3, 4, 5, 6]
# Nový seznam obsahuje všechny prvky původního,
# ale každý prvk je vynásoben 100
[item*100 for item in my_list]
[100, 200, 300, 400, 500, 600]
# Nový seznam obsahuje jenom ty prvky seznamu, které jsou sudé
[item for item in my_list if item%2 == 0]
[2, 4, 6]
# Obdobným způsobem vytvoříme seznam názvu sloupců,
# které obsahují 'Unnamed' (částečná shoda)
# To jsou sloupce, které chceme vyhodit
col_to_drop = [
col_name for col_name in unicef.columns if 'Unnamed' in col_name
]
col_to_drop
['Unnamed: 3', 'Unnamed: 5', 'Unnamed: 7', 'Unnamed: 9', 'Unnamed: 11', 'Unnamed: 13', 'Unnamed: 15', 'Unnamed: 17', 'Unnamed: 19', 'Unnamed: 21', 'Unnamed: 23']
# Odstraníme z tabulky seznam sloupců,
# které jsme vytvořili v předešlém kroku
unicef.drop(col_to_drop, axis='columns', inplace=True)
unicef.head()
ISO3 Code | Countries | Total | Male | Female | Urban | Rural | Poorest | Poorer | Middle | Fourth | Richest | Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | AFG | Afghanistan | 36.3 | 26.9 | 46.8 | 19.7 | 41.4 | 40.6 | 46.2 | 44.4 | 31.8 | 16.7 | DHS 2015 |
1 | ALB | Albania | 7.60232 | 7.76783 | 7.41396 | 6.66002 | 8.25574 | 8.8999 | 6.45801 | 8.8039 | 6.7463 | 6.39511 | DHS 2008-09 |
2 | DZA | Algeria | 2.19084 | 2.1955 | 2.18592 | 1.94815 | 2.61943 | 3.4101 | 2.57707 | 1.7089 | 1.65181 | 1.42334 | MICS 2012-13 |
3 | AND | Andorra | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | AGO | Angola | 23.7 | 22.8 | 24.6 | 15.1 | 33.2 | 37.2 | 25.8 | 22.5 | 14.4 | 9.7 | IBEP 2008-09 |
primates = pd.read_csv('primates.csv')
# Zobrazíme názvy sloupců
primates.columns
Index(['Unnamed: 0', 'Bodywt', 'Brainwt'], dtype='object')
# inplace=True znamená, že se pozmění současná tabulka
primates.rename(
columns={'Unnamed: 0': 'Species'},
inplace=True
)
primates.head()
Species | Bodywt | Brainwt | |
---|---|---|---|
0 | Potar monkey | 10.0 | 115 |
1 | Gorilla | 207.0 | 406 |
2 | Human | 62.0 | 1320 |
3 | Rhesus monkey | 6.8 | 179 |
4 | Chimp | 52.2 | 440 |
titanic3.xls
pomocí read_excel
funkce z pandaspclass
na passenger_class
a home.dest
na home_destination
titanic = pd.read_excel('titanic3.xls')
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
titanic.columns
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'], dtype='object')
titanic.rename(
columns={
'pclass': 'passenger_class',
'home.dest':'home_destination'
},
inplace=True
)
titanic.head()
passenger_class | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home_destination | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
primates.dtypes
Species object Bodywt float64 Brainwt int64 dtype: object
titanic.dtypes
passenger_class int64 survived int64 name object sex object age float64 sibsp int64 parch int64 ticket object fare float64 cabin object embarked object boat object body float64 home_destination object dtype: object
# Počet řádků a počet sloupců
primates.shape
(5, 3)
# Počet řádků
primates.shape[0]
5
# Počet sloupců
primates.shape[1]
3
# Počet řádků
len(primates)
5
titanic3.xls
pomocí read_excel
funkce z pandastitanic.shape[0]
1309
titanic.shape[1]
14
# Načteme data z Excelu
titanic = pd.read_excel('titanic3.xls')
titanic.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
# Jaké sloupce má tabulka?
titanic.columns
Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket', 'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'], dtype='object')
# Má tabulká chybějící data?
titanic.isnull().values.any()
True
# Zjistíme, které sloupce mají chybějící data
titanic.isnull().any()
pclass False survived False name False sex False age True sibsp False parch False ticket False fare True cabin True embarked True boat True body True home.dest True dtype: bool
# Zobrazíme počet chybějících hodnot v každém sloupci
titanic.isnull().sum()
pclass 0 survived 0 name 0 sex 0 age 263 sibsp 0 parch 0 ticket 0 fare 1 cabin 1014 embarked 2 boat 823 body 1188 home.dest 564 dtype: int64
YouthRisk2009.csv
Bonus: Zjistěte kolík řádků má chybějící data (správná odpověď je 72)
youth_risk = pd.read_csv('YouthRisk2009.csv', index_col=0)
youth_risk.head()
Sleep | Sleep7 | SmokeLife | SmokeDaily | MarijuaEver | Age | |
---|---|---|---|---|---|---|
1 | NaN | NaN | No | NaN | 0.0 | 16 |
2 | 8 hours | 1.0 | Yes | Yes | 1.0 | 16 |
3 | 5 hours | 0.0 | Yes | Yes | 1.0 | 17 |
4 | 5 hours | 0.0 | Yes | Yes | 1.0 | 18 |
5 | 7 hours | 1.0 | Yes | No | 1.0 | 17 |
# Nastavíme maximální počet řádků, které se zobrazí při náhledu
pd.options.display.max_rows = 100
youth_risk.isnull().any(axis=1)
1 True 2 False 3 False 4 False 5 False 6 False 7 False 8 False 9 False 10 False 11 False 12 False 13 False 14 False 15 False 16 False 17 False 18 False 19 False 20 False 21 False 22 False 23 False 24 False 25 False 26 False 27 False 28 False 29 False 30 False 31 False 32 True 33 False 34 False 35 False 36 True 37 False 38 False 39 False 40 False 41 False 42 False 43 False 44 False 45 False 46 False 47 False 48 True 49 False 50 False ... 451 False 452 False 453 False 454 False 455 False 456 False 457 False 458 False 459 False 460 False 461 False 462 True 463 False 464 False 465 False 466 False 467 False 468 False 469 False 470 False 471 False 472 False 473 True 474 False 475 False 476 False 477 False 478 False 479 False 480 False 481 False 482 False 483 False 484 False 485 False 486 False 487 False 488 False 489 False 490 False 491 False 492 False 493 False 494 False 495 False 496 False 497 False 498 False 499 False 500 False Length: 500, dtype: bool
# Počet řádků s chybějícími hodnotami
youth_risk.isnull().any(axis=1).sum()
72
Pandas má 2 způsoby:
iloc
vybírá podle pozic (indexů)loc
vybírá podle jmen# Načteme data
primates = pd.read_csv('primates.csv')
primates.rename(index=str, columns={'Unnamed: 0': 'Species'}, inplace=True)
primates.head()
Species | Bodywt | Brainwt | |
---|---|---|---|
0 | Potar monkey | 10.0 | 115 |
1 | Gorilla | 207.0 | 406 |
2 | Human | 62.0 | 1320 |
3 | Rhesus monkey | 6.8 | 179 |
4 | Chimp | 52.2 | 440 |
# Vybereme 0. a 1. řádky
primates.iloc[0:2]
Species | Bodywt | Brainwt | |
---|---|---|---|
0 | Potar monkey | 10.0 | 115 |
1 | Gorilla | 207.0 | 406 |
# Vybereme 0. a 1. řádky a 1. a 2. sloupce
primates.iloc[0:2, [0,2]]
Species | Brainwt | |
---|---|---|
0 | Potar monkey | 115 |
1 | Gorilla | 406 |
# Vybíráme podle jména
# POZOR: číselný index musí být převeden řetězec
primates.loc[:'2', ['Species', 'Bodywt']]
Species | Bodywt | |
---|---|---|
0 | Potar monkey | 10.0 |
1 | Gorilla | 207.0 |
2 | Human | 62.0 |
# Vybereme sloupec
primates.Bodywt
0 10.0 1 207.0 2 62.0 3 6.8 4 52.2 Name: Bodywt, dtype: float64
# Vybereme sloupec - druhy způsob
primates['Bodywt']
0 10.0 1 207.0 2 62.0 3 6.8 4 52.2 Name: Bodywt, dtype: float64
# Lze specifikovat, který sloupec je index
# Index se v tomto případě skládá z řetězců
primates2 = pd.read_csv(
'primates.csv',
index_col=0
)
primates2.head()
Bodywt | Brainwt | |
---|---|---|
Potar monkey | 10.0 | 115 |
Gorilla | 207.0 | 406 |
Human | 62.0 | 1320 |
Rhesus monkey | 6.8 | 179 |
Chimp | 52.2 | 440 |
primates2.loc['Potar monkey']
Bodywt 10.0 Brainwt 115.0 Name: Potar monkey, dtype: float64
titanic
name
)Bonus: Vyberte řádky 10 až 550 a následující sloupce: jméno (name
), cena jízdenky (fare
), zda osoba přežila (survived
)
titanic.iloc[:20]
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
5 | 1 | 1 | Anderson, Mr. Harry | male | 48.0000 | 0 | 0 | 19952 | 26.5500 | E12 | S | 3 | NaN | New York, NY |
6 | 1 | 1 | Andrews, Miss. Kornelia Theodosia | female | 63.0000 | 1 | 0 | 13502 | 77.9583 | D7 | S | 10 | NaN | Hudson, NY |
7 | 1 | 0 | Andrews, Mr. Thomas Jr | male | 39.0000 | 0 | 0 | 112050 | 0.0000 | A36 | S | NaN | NaN | Belfast, NI |
8 | 1 | 1 | Appleton, Mrs. Edward Dale (Charlotte Lamson) | female | 53.0000 | 2 | 0 | 11769 | 51.4792 | C101 | S | D | NaN | Bayside, Queens, NY |
9 | 1 | 0 | Artagaveytia, Mr. Ramon | male | 71.0000 | 0 | 0 | PC 17609 | 49.5042 | NaN | C | NaN | 22.0 | Montevideo, Uruguay |
10 | 1 | 0 | Astor, Col. John Jacob | male | 47.0000 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C | NaN | 124.0 | New York, NY |
11 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18.0000 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C | 4 | NaN | New York, NY |
12 | 1 | 1 | Aubart, Mme. Leontine Pauline | female | 24.0000 | 0 | 0 | PC 17477 | 69.3000 | B35 | C | 9 | NaN | Paris, France |
13 | 1 | 1 | Barber, Miss. Ellen "Nellie" | female | 26.0000 | 0 | 0 | 19877 | 78.8500 | NaN | S | 6 | NaN | NaN |
14 | 1 | 1 | Barkworth, Mr. Algernon Henry Wilson | male | 80.0000 | 0 | 0 | 27042 | 30.0000 | A23 | S | B | NaN | Hessle, Yorks |
15 | 1 | 0 | Baumann, Mr. John D | male | NaN | 0 | 0 | PC 17318 | 25.9250 | NaN | S | NaN | NaN | New York, NY |
16 | 1 | 0 | Baxter, Mr. Quigg Edmond | male | 24.0000 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | NaN | NaN | Montreal, PQ |
17 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50.0000 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C | 6 | NaN | Montreal, PQ |
18 | 1 | 1 | Bazzani, Miss. Albina | female | 32.0000 | 0 | 0 | 11813 | 76.2917 | D15 | C | 8 | NaN | NaN |
19 | 1 | 0 | Beattie, Mr. Thomson | male | 36.0000 | 0 | 0 | 13050 | 75.2417 | C6 | C | A | NaN | Winnipeg, MN |
titanic.iloc[:20, 2]
0 Allen, Miss. Elisabeth Walton 1 Allison, Master. Hudson Trevor 2 Allison, Miss. Helen Loraine 3 Allison, Mr. Hudson Joshua Creighton 4 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) 5 Anderson, Mr. Harry 6 Andrews, Miss. Kornelia Theodosia 7 Andrews, Mr. Thomas Jr 8 Appleton, Mrs. Edward Dale (Charlotte Lamson) 9 Artagaveytia, Mr. Ramon 10 Astor, Col. John Jacob 11 Astor, Mrs. John Jacob (Madeleine Talmadge Force) 12 Aubart, Mme. Leontine Pauline 13 Barber, Miss. Ellen "Nellie" 14 Barkworth, Mr. Algernon Henry Wilson 15 Baumann, Mr. John D 16 Baxter, Mr. Quigg Edmond 17 Baxter, Mrs. James (Helene DeLaudeniere Chaput) 18 Bazzani, Miss. Albina 19 Beattie, Mr. Thomson Name: name, dtype: object
titanic.iloc[10:551, [1,2,8]]
survived | name | fare | |
---|---|---|---|
10 | 0 | Astor, Col. John Jacob | 227.5250 |
11 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | 227.5250 |
12 | 1 | Aubart, Mme. Leontine Pauline | 69.3000 |
13 | 1 | Barber, Miss. Ellen "Nellie" | 78.8500 |
14 | 1 | Barkworth, Mr. Algernon Henry Wilson | 30.0000 |
15 | 0 | Baumann, Mr. John D | 25.9250 |
16 | 0 | Baxter, Mr. Quigg Edmond | 247.5208 |
17 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | 247.5208 |
18 | 1 | Bazzani, Miss. Albina | 76.2917 |
19 | 0 | Beattie, Mr. Thomson | 75.2417 |
20 | 1 | Beckwith, Mr. Richard Leonard | 52.5542 |
21 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | 52.5542 |
22 | 1 | Behr, Mr. Karl Howell | 30.0000 |
23 | 1 | Bidois, Miss. Rosalie | 227.5250 |
24 | 1 | Bird, Miss. Ellen | 221.7792 |
25 | 0 | Birnbaum, Mr. Jakob | 26.0000 |
26 | 1 | Bishop, Mr. Dickinson H | 91.0792 |
27 | 1 | Bishop, Mrs. Dickinson H (Helen Walton) | 91.0792 |
28 | 1 | Bissette, Miss. Amelia | 135.6333 |
29 | 1 | Bjornstrom-Steffansson, Mr. Mauritz Hakan | 26.5500 |
30 | 0 | Blackwell, Mr. Stephen Weart | 35.5000 |
31 | 1 | Blank, Mr. Henry | 31.0000 |
32 | 1 | Bonnell, Miss. Caroline | 164.8667 |
33 | 1 | Bonnell, Miss. Elizabeth | 26.5500 |
34 | 0 | Borebank, Mr. John James | 26.5500 |
35 | 1 | Bowen, Miss. Grace Scott | 262.3750 |
36 | 1 | Bowerman, Miss. Elsie Edith | 55.0000 |
37 | 1 | Bradley, Mr. George ("George Arthur Brayton") | 26.5500 |
38 | 0 | Brady, Mr. John Bertram | 30.5000 |
39 | 0 | Brandeis, Mr. Emil | 50.4958 |
40 | 0 | Brewe, Dr. Arthur Jackson | 39.6000 |
41 | 1 | Brown, Mrs. James Joseph (Margaret Tobin) | 27.7208 |
42 | 1 | Brown, Mrs. John Murray (Caroline Lane Lamson) | 51.4792 |
43 | 1 | Bucknell, Mrs. William Robert (Emma Eliza Ward) | 76.2917 |
44 | 1 | Burns, Miss. Elizabeth Margaret | 134.5000 |
45 | 0 | Butt, Major. Archibald Willingham | 26.5500 |
46 | 0 | Cairns, Mr. Alexander | 31.0000 |
47 | 1 | Calderhead, Mr. Edward Pennington | 26.2875 |
48 | 1 | Candee, Mrs. Edward (Helen Churchill Hungerford) | 27.4458 |
49 | 1 | Cardeza, Mr. Thomas Drake Martinez | 512.3292 |
50 | 1 | Cardeza, Mrs. James Warburton Martinez (Charlo... | 512.3292 |
51 | 0 | Carlsson, Mr. Frans Olof | 5.0000 |
52 | 0 | Carrau, Mr. Francisco M | 47.1000 |
53 | 0 | Carrau, Mr. Jose Pedro | 47.1000 |
54 | 1 | Carter, Master. William Thornton II | 120.0000 |
55 | 1 | Carter, Miss. Lucile Polk | 120.0000 |
56 | 1 | Carter, Mr. William Ernest | 120.0000 |
57 | 1 | Carter, Mrs. William Ernest (Lucile Polk) | 120.0000 |
58 | 0 | Case, Mr. Howard Brown | 26.0000 |
59 | 1 | Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev... | 27.7208 |
... | ... | ... | ... |
501 | 1 | Mellinger, Miss. Madeleine Violet | 19.5000 |
502 | 1 | Mellinger, Mrs. (Elizabeth Anne Maidment) | 19.5000 |
503 | 1 | Mellors, Mr. William John | 10.5000 |
504 | 0 | Meyer, Mr. August | 13.0000 |
505 | 0 | Milling, Mr. Jacob Christian | 13.0000 |
506 | 0 | Mitchell, Mr. Henry Michael | 10.5000 |
507 | 0 | Montvila, Rev. Juozas | 13.0000 |
508 | 0 | Moraweck, Dr. Ernest | 14.0000 |
509 | 0 | Morley, Mr. Henry Samuel ("Mr Henry Marshall") | 26.0000 |
510 | 0 | Mudd, Mr. Thomas Charles | 10.5000 |
511 | 0 | Myles, Mr. Thomas Francis | 9.6875 |
512 | 0 | Nasser, Mr. Nicholas | 30.0708 |
513 | 1 | Nasser, Mrs. Nicholas (Adele Achem) | 30.0708 |
514 | 1 | Navratil, Master. Edmond Roger | 26.0000 |
515 | 1 | Navratil, Master. Michel M | 26.0000 |
516 | 0 | Navratil, Mr. Michel ("Louis M Hoffman") | 26.0000 |
517 | 0 | Nesson, Mr. Israel | 13.0000 |
518 | 0 | Nicholls, Mr. Joseph Charles | 36.7500 |
519 | 0 | Norman, Mr. Robert Douglas | 13.5000 |
520 | 1 | Nourney, Mr. Alfred ("Baron von Drachstedt") | 13.8625 |
521 | 1 | Nye, Mrs. (Elizabeth Ramell) | 10.5000 |
522 | 0 | Otter, Mr. Richard | 13.0000 |
523 | 1 | Oxenham, Mr. Percy Thomas | 10.5000 |
524 | 1 | Padro y Manent, Mr. Julian | 13.8625 |
525 | 0 | Pain, Dr. Alfred | 10.5000 |
526 | 1 | Pallas y Castello, Mr. Emilio | 13.8583 |
527 | 0 | Parker, Mr. Clifford Richard | 10.5000 |
528 | 0 | Parkes, Mr. Francis "Frank" | 0.0000 |
529 | 1 | Parrish, Mrs. (Lutie Davis) | 26.0000 |
530 | 0 | Pengelly, Mr. Frederick William | 10.5000 |
531 | 0 | Pernot, Mr. Rene | 15.0500 |
532 | 0 | Peruschitz, Rev. Joseph Maria | 13.0000 |
533 | 1 | Phillips, Miss. Alice Frances Louisa | 21.0000 |
534 | 1 | Phillips, Miss. Kate Florence ("Mrs Kate Louis... | 26.0000 |
535 | 0 | Phillips, Mr. Escott Robert | 21.0000 |
536 | 1 | Pinsky, Mrs. (Rosa) | 13.0000 |
537 | 0 | Ponesell, Mr. Martin | 13.0000 |
538 | 1 | Portaluppi, Mr. Emilio Ilario Giuseppe | 12.7375 |
539 | 0 | Pulbaum, Mr. Franz | 15.0333 |
540 | 1 | Quick, Miss. Phyllis May | 26.0000 |
541 | 1 | Quick, Miss. Winifred Vera | 26.0000 |
542 | 1 | Quick, Mrs. Frederick Charles (Jane Richards) | 26.0000 |
543 | 0 | Reeves, Mr. David | 10.5000 |
544 | 0 | Renouf, Mr. Peter Henry | 21.0000 |
545 | 1 | Renouf, Mrs. Peter Henry (Lillian Jefferys) | 21.0000 |
546 | 1 | Reynaldo, Ms. Encarnacion | 13.0000 |
547 | 0 | Richard, Mr. Emile | 15.0458 |
548 | 1 | Richards, Master. George Sibley | 18.7500 |
549 | 1 | Richards, Master. William Rowe | 18.7500 |
550 | 1 | Richards, Mrs. Sidney (Emily Hocking) | 18.7500 |
541 rows × 3 columns
titanic.loc[10:550, ['name', 'fare', 'survived']]
name | fare | survived | |
---|---|---|---|
10 | Astor, Col. John Jacob | 227.5250 | 0 |
11 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | 227.5250 | 1 |
12 | Aubart, Mme. Leontine Pauline | 69.3000 | 1 |
13 | Barber, Miss. Ellen "Nellie" | 78.8500 | 1 |
14 | Barkworth, Mr. Algernon Henry Wilson | 30.0000 | 1 |
15 | Baumann, Mr. John D | 25.9250 | 0 |
16 | Baxter, Mr. Quigg Edmond | 247.5208 | 0 |
17 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | 247.5208 | 1 |
18 | Bazzani, Miss. Albina | 76.2917 | 1 |
19 | Beattie, Mr. Thomson | 75.2417 | 0 |
20 | Beckwith, Mr. Richard Leonard | 52.5542 | 1 |
21 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | 52.5542 | 1 |
22 | Behr, Mr. Karl Howell | 30.0000 | 1 |
23 | Bidois, Miss. Rosalie | 227.5250 | 1 |
24 | Bird, Miss. Ellen | 221.7792 | 1 |
25 | Birnbaum, Mr. Jakob | 26.0000 | 0 |
26 | Bishop, Mr. Dickinson H | 91.0792 | 1 |
27 | Bishop, Mrs. Dickinson H (Helen Walton) | 91.0792 | 1 |
28 | Bissette, Miss. Amelia | 135.6333 | 1 |
29 | Bjornstrom-Steffansson, Mr. Mauritz Hakan | 26.5500 | 1 |
30 | Blackwell, Mr. Stephen Weart | 35.5000 | 0 |
31 | Blank, Mr. Henry | 31.0000 | 1 |
32 | Bonnell, Miss. Caroline | 164.8667 | 1 |
33 | Bonnell, Miss. Elizabeth | 26.5500 | 1 |
34 | Borebank, Mr. John James | 26.5500 | 0 |
35 | Bowen, Miss. Grace Scott | 262.3750 | 1 |
36 | Bowerman, Miss. Elsie Edith | 55.0000 | 1 |
37 | Bradley, Mr. George ("George Arthur Brayton") | 26.5500 | 1 |
38 | Brady, Mr. John Bertram | 30.5000 | 0 |
39 | Brandeis, Mr. Emil | 50.4958 | 0 |
40 | Brewe, Dr. Arthur Jackson | 39.6000 | 0 |
41 | Brown, Mrs. James Joseph (Margaret Tobin) | 27.7208 | 1 |
42 | Brown, Mrs. John Murray (Caroline Lane Lamson) | 51.4792 | 1 |
43 | Bucknell, Mrs. William Robert (Emma Eliza Ward) | 76.2917 | 1 |
44 | Burns, Miss. Elizabeth Margaret | 134.5000 | 1 |
45 | Butt, Major. Archibald Willingham | 26.5500 | 0 |
46 | Cairns, Mr. Alexander | 31.0000 | 0 |
47 | Calderhead, Mr. Edward Pennington | 26.2875 | 1 |
48 | Candee, Mrs. Edward (Helen Churchill Hungerford) | 27.4458 | 1 |
49 | Cardeza, Mr. Thomas Drake Martinez | 512.3292 | 1 |
50 | Cardeza, Mrs. James Warburton Martinez (Charlo... | 512.3292 | 1 |
51 | Carlsson, Mr. Frans Olof | 5.0000 | 0 |
52 | Carrau, Mr. Francisco M | 47.1000 | 0 |
53 | Carrau, Mr. Jose Pedro | 47.1000 | 0 |
54 | Carter, Master. William Thornton II | 120.0000 | 1 |
55 | Carter, Miss. Lucile Polk | 120.0000 | 1 |
56 | Carter, Mr. William Ernest | 120.0000 | 1 |
57 | Carter, Mrs. William Ernest (Lucile Polk) | 120.0000 | 1 |
58 | Case, Mr. Howard Brown | 26.0000 | 0 |
59 | Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev... | 27.7208 | 1 |
... | ... | ... | ... |
501 | Mellinger, Miss. Madeleine Violet | 19.5000 | 1 |
502 | Mellinger, Mrs. (Elizabeth Anne Maidment) | 19.5000 | 1 |
503 | Mellors, Mr. William John | 10.5000 | 1 |
504 | Meyer, Mr. August | 13.0000 | 0 |
505 | Milling, Mr. Jacob Christian | 13.0000 | 0 |
506 | Mitchell, Mr. Henry Michael | 10.5000 | 0 |
507 | Montvila, Rev. Juozas | 13.0000 | 0 |
508 | Moraweck, Dr. Ernest | 14.0000 | 0 |
509 | Morley, Mr. Henry Samuel ("Mr Henry Marshall") | 26.0000 | 0 |
510 | Mudd, Mr. Thomas Charles | 10.5000 | 0 |
511 | Myles, Mr. Thomas Francis | 9.6875 | 0 |
512 | Nasser, Mr. Nicholas | 30.0708 | 0 |
513 | Nasser, Mrs. Nicholas (Adele Achem) | 30.0708 | 1 |
514 | Navratil, Master. Edmond Roger | 26.0000 | 1 |
515 | Navratil, Master. Michel M | 26.0000 | 1 |
516 | Navratil, Mr. Michel ("Louis M Hoffman") | 26.0000 | 0 |
517 | Nesson, Mr. Israel | 13.0000 | 0 |
518 | Nicholls, Mr. Joseph Charles | 36.7500 | 0 |
519 | Norman, Mr. Robert Douglas | 13.5000 | 0 |
520 | Nourney, Mr. Alfred ("Baron von Drachstedt") | 13.8625 | 1 |
521 | Nye, Mrs. (Elizabeth Ramell) | 10.5000 | 1 |
522 | Otter, Mr. Richard | 13.0000 | 0 |
523 | Oxenham, Mr. Percy Thomas | 10.5000 | 1 |
524 | Padro y Manent, Mr. Julian | 13.8625 | 1 |
525 | Pain, Dr. Alfred | 10.5000 | 0 |
526 | Pallas y Castello, Mr. Emilio | 13.8583 | 1 |
527 | Parker, Mr. Clifford Richard | 10.5000 | 0 |
528 | Parkes, Mr. Francis "Frank" | 0.0000 | 0 |
529 | Parrish, Mrs. (Lutie Davis) | 26.0000 | 1 |
530 | Pengelly, Mr. Frederick William | 10.5000 | 0 |
531 | Pernot, Mr. Rene | 15.0500 | 0 |
532 | Peruschitz, Rev. Joseph Maria | 13.0000 | 0 |
533 | Phillips, Miss. Alice Frances Louisa | 21.0000 | 1 |
534 | Phillips, Miss. Kate Florence ("Mrs Kate Louis... | 26.0000 | 1 |
535 | Phillips, Mr. Escott Robert | 21.0000 | 0 |
536 | Pinsky, Mrs. (Rosa) | 13.0000 | 1 |
537 | Ponesell, Mr. Martin | 13.0000 | 0 |
538 | Portaluppi, Mr. Emilio Ilario Giuseppe | 12.7375 | 1 |
539 | Pulbaum, Mr. Franz | 15.0333 | 0 |
540 | Quick, Miss. Phyllis May | 26.0000 | 1 |
541 | Quick, Miss. Winifred Vera | 26.0000 | 1 |
542 | Quick, Mrs. Frederick Charles (Jane Richards) | 26.0000 | 1 |
543 | Reeves, Mr. David | 10.5000 | 0 |
544 | Renouf, Mr. Peter Henry | 21.0000 | 0 |
545 | Renouf, Mrs. Peter Henry (Lillian Jefferys) | 21.0000 | 1 |
546 | Reynaldo, Ms. Encarnacion | 13.0000 | 1 |
547 | Richard, Mr. Emile | 15.0458 | 0 |
548 | Richards, Master. George Sibley | 18.7500 | 1 |
549 | Richards, Master. William Rowe | 18.7500 | 1 |
550 | Richards, Mrs. Sidney (Emily Hocking) | 18.7500 | 1 |
541 rows × 3 columns
Primáti:
primates
Species | Bodywt | Brainwt | |
---|---|---|---|
0 | Potar monkey | 10.0 | 115 |
1 | Gorilla | 207.0 | 406 |
2 | Human | 62.0 | 1320 |
3 | Rhesus monkey | 6.8 | 179 |
4 | Chimp | 52.2 | 440 |
brainwt_kg = primates.Brainwt / 1000
brainwt_kg
0 0.115 1 0.406 2 1.320 3 0.179 4 0.440 Name: Brainwt, dtype: float64
brainwt_kg / primates.Bodywt
0 0.011500 1 0.001961 2 0.021290 3 0.026324 4 0.008429 dtype: float64
# První způsob vytvoření nového sloupce
primates['ratio'] = brainwt_kg / primates.Bodywt
primates
Species | Bodywt | Brainwt | ratio | |
---|---|---|---|---|
0 | Potar monkey | 10.0 | 115 | 0.011500 |
1 | Gorilla | 207.0 | 406 | 0.001961 |
2 | Human | 62.0 | 1320 | 0.021290 |
3 | Rhesus monkey | 6.8 | 179 | 0.026324 |
4 | Chimp | 52.2 | 440 | 0.008429 |
# První způsob vytvoření nového sloupce
primates.assign(ratio2=(brainwt_kg / primates.Bodywt), inplace=True)
primates
Species | Bodywt | Brainwt | ratio | |
---|---|---|---|---|
0 | Potar monkey | 10.0 | 115 | 0.011500 |
1 | Gorilla | 207.0 | 406 | 0.001961 |
2 | Human | 62.0 | 1320 | 0.021290 |
3 | Rhesus monkey | 6.8 | 179 | 0.026324 |
4 | Chimp | 52.2 | 440 | 0.008429 |
# Nevektorové oprerace - iterujeme řádky tabulky
for row_index, row in primates.iterrows():
print(row.Bodywt)
10.0 207.0 62.0 6.8 52.2
primates2
Bodywt | Brainwt | |
---|---|---|
Potar monkey | 10.0 | 115 |
Gorilla | 207.0 | 406 |
Human | 62.0 | 1320 |
Rhesus monkey | 6.8 | 179 |
Chimp | 52.2 | 440 |
primates2.columns
Index(['Bodywt', 'Brainwt'], dtype='object')
primates2.Bodywt
Potar monkey 10.0 Gorilla 207.0 Human 62.0 Rhesus monkey 6.8 Chimp 52.2 Name: Bodywt, dtype: float64
primates2.Bodywt > 60
Potar monkey False Gorilla True Human True Rhesus monkey False Chimp False Name: Bodywt, dtype: bool
# ~ znamená negace
~(primates2.Brainwt < 500)
Potar monkey False Gorilla False Human True Rhesus monkey False Chimp False Name: Brainwt, dtype: bool
# | znamená nebo
primates2[(primates2.Bodywt > 60) | (primates2.Brainwt >= 500)]
Bodywt | Brainwt | |
---|---|---|
Gorilla | 207.0 | 406 |
Human | 62.0 | 1320 |
# Vyberéme řádky, kde tělesná hmota je více než 60
primates2[primates2.Bodywt > 60]
Bodywt | Brainwt | |
---|---|---|
Gorilla | 207.0 | 406 |
Human | 62.0 | 1320 |
# Vyberéme řádky, kde je tělesná hmota více než 60 a
# hmota mozku je méně než 500
primates2[(primates2.Bodywt > 60) & (primates2.Brainwt < 500)]
Bodywt | Brainwt | |
---|---|---|
Gorilla | 207.0 | 406 |
titanic
Bonus: Pro vybrané řádky spočítejte průměrnou cenu jízdenky (fare
) (správná odpověď je cca 51)
sum(titanic.sex == 'female')
466
len(titanic[titanic.sex == 'female'])
466
titanic[titanic.sex == 'female'].shape[0]
466
sum((titanic.age > 1) & (titanic.sex == 'female'))
379
titanic[(titanic.age > 1) & (titanic.sex == 'female')].fare.mean()
51.017481530343005
Použijeme metodu groupby
, která seskupí řádky dle jednoho nebo více sloupců.
Potom použijeme nějakou metodu nad seskupenými řádky (count
, mean
, sum
).
# Načteme data z Excelu
data = pd.read_excel('titanic3.xls')
data.head()
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.0000 | 0 | 0 | 24160 | 211.3375 | B5 | S | 2 | NaN | St Louis, MO |
1 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.9167 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | 11 | NaN | Montreal, PQ / Chesterville, ON |
2 | 1 | 0 | Allison, Miss. Helen Loraine | female | 2.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
3 | 1 | 0 | Allison, Mr. Hudson Joshua Creighton | male | 30.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | 135.0 | Montreal, PQ / Chesterville, ON |
4 | 1 | 0 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.0000 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S | NaN | NaN | Montreal, PQ / Chesterville, ON |
# Groupby vrací iterátor
groups = titanic.groupby('sex')
# Unikátní hodnoty dle kterých je rozdělena tabulka
groups.groups.keys()
dict_keys(['female', 'male'])
# Seskupíme řádky dle pohlaví a spočítáme průměrnou hodnotu všech číselných proměnných
data.groupby('sex').mean()
pclass | survived | age | sibsp | parch | fare | body | |
---|---|---|---|---|---|---|---|
sex | |||||||
female | 2.154506 | 0.727468 | 28.687071 | 0.652361 | 0.633047 | 46.198097 | 166.62500 |
male | 2.372479 | 0.190985 | 30.585233 | 0.413998 | 0.247924 | 26.154601 | 160.39823 |
# Totéž můžeme udělat pomocí filtrování
titanic[titanic.sex == 'female'].mean()
pclass 2.154506 survived 0.727468 age 28.687071 sibsp 0.652361 parch 0.633047 fare 46.198097 body 166.625000 dtype: float64
titanic[titanic.sex == 'male'].mean()
pclass 2.372479 survived 0.190985 age 30.585233 sibsp 0.413998 parch 0.247924 fare 26.154601 body 160.398230 dtype: float64
# Ve smyčce vytiskneme průměrné hodnoty pro cestující obou pohlaví
for sex in titanic.sex.unique():
print(sex, titanic[titanic.sex == sex].mean())
female pclass 2.154506 survived 0.727468 age 28.687071 sibsp 0.652361 parch 0.633047 fare 46.198097 body 166.625000 dtype: float64 male pclass 2.372479 survived 0.190985 age 30.585233 sibsp 0.413998 parch 0.247924 fare 26.154601 body 160.398230 dtype: float64
# Seskupení dle více proměnných - pohlaví a místa nalodění
titanic.groupby(['sex', 'embarked', 'survived']).mean()
pclass | age | sibsp | parch | fare | body | |||
---|---|---|---|---|---|---|---|---|
sex | embarked | survived | ||||||
female | C | 0 | 2.636364 | 27.071429 | 0.454545 | 0.454545 | 17.460991 | 328.000000 |
1 | 1.539216 | 31.543956 | 0.509804 | 0.519608 | 87.994648 | NaN | ||
Q | 0 | 3.000000 | 26.692308 | 0.086957 | 0.347826 | 9.311961 | 229.000000 | |
1 | 2.837838 | 24.000000 | 0.270270 | 0.000000 | 14.564192 | NaN | ||
S | 0 | 2.806452 | 24.842105 | 1.333333 | 1.075269 | 26.117341 | 79.500000 | |
1 | 1.924242 | 29.104167 | 0.560606 | 0.651515 | 45.549622 | NaN | ||
male | C | 0 | 2.119266 | 35.178082 | 0.302752 | 0.201835 | 42.555965 | 155.083333 |
1 | 1.729167 | 29.912602 | 0.375000 | 0.416667 | 63.013896 | NaN | ||
Q | 0 | 2.875000 | 32.104167 | 0.500000 | 0.107143 | 12.561384 | 134.750000 | |
1 | 3.000000 | 25.000000 | 0.285714 | 0.000000 | 9.974400 | NaN | ||
S | 0 | 2.460348 | 30.855829 | 0.444874 | 0.241779 | 20.722284 | 163.105882 | |
1 | 2.188679 | 25.712862 | 0.358491 | 0.339623 | 27.291978 | NaN |
Seskupte titanic
dle třídy (pclass
) a pohlaví (sex
) zobrazte medián (median
)
Bonus: zobrazte jenom sloupec věk (age
)
titanic.groupby(['pclass', 'sex']).median()
survived | age | sibsp | parch | fare | body | ||
---|---|---|---|---|---|---|---|
pclass | sex | ||||||
1 | female | 1 | 36.0 | 0 | 0 | 80.92915 | NaN |
male | 0 | 42.0 | 0 | 0 | 49.50420 | 166.0 | |
2 | female | 1 | 28.0 | 0 | 0 | 23.00000 | 52.0 |
male | 0 | 29.5 | 0 | 0 | 13.00000 | 160.0 | |
3 | female | 0 | 22.0 | 0 | 0 | 10.48960 | 206.0 |
male | 0 | 25.0 | 0 | 0 | 7.89580 | 148.0 |
# Pokud seskupojeme podle 2 a více proměnných, výsledná tabulka
# bude mít několikaúrovňový index
titanic.groupby(['pclass', 'sex']).median()['age'].index
MultiIndex(levels=[[1, 2, 3], ['female', 'male']], labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]], names=['pclass', 'sex'])
titanic.groupby(['pclass',]).median()['age'].index
Int64Index([1, 2, 3], dtype='int64', name='pclass')
titanic.groupby(['pclass', 'sex']).median().age
pclass sex 1 female 36.0 male 42.0 2 female 28.0 male 29.5 3 female 22.0 male 25.0 Name: age, dtype: float64
# Bez include='all' jenom numerické sloupce
titanic.describe(include='all')
pclass | survived | name | sex | age | sibsp | parch | ticket | fare | cabin | embarked | boat | body | home.dest | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1309.000000 | 1309.000000 | 1309 | 1309 | 1046.000000 | 1309.000000 | 1309.000000 | 1309 | 1308.000000 | 295 | 1307 | 486 | 121.000000 | 745 |
unique | NaN | NaN | 1307 | 2 | NaN | NaN | NaN | 939 | NaN | 186 | 3 | 28 | NaN | 369 |
top | NaN | NaN | Connolly, Miss. Kate | male | NaN | NaN | NaN | CA. 2343 | NaN | C23 C25 C27 | S | 13 | NaN | New York, NY |
freq | NaN | NaN | 2 | 843 | NaN | NaN | NaN | 11 | NaN | 6 | 914 | 39 | NaN | 64 |
mean | 2.294882 | 0.381971 | NaN | NaN | 29.881135 | 0.498854 | 0.385027 | NaN | 33.295479 | NaN | NaN | NaN | 160.809917 | NaN |
std | 0.837836 | 0.486055 | NaN | NaN | 14.413500 | 1.041658 | 0.865560 | NaN | 51.758668 | NaN | NaN | NaN | 97.696922 | NaN |
min | 1.000000 | 0.000000 | NaN | NaN | 0.166700 | 0.000000 | 0.000000 | NaN | 0.000000 | NaN | NaN | NaN | 1.000000 | NaN |
25% | 2.000000 | 0.000000 | NaN | NaN | 21.000000 | 0.000000 | 0.000000 | NaN | 7.895800 | NaN | NaN | NaN | 72.000000 | NaN |
50% | 3.000000 | 0.000000 | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | NaN | 14.454200 | NaN | NaN | NaN | 155.000000 | NaN |
75% | 3.000000 | 1.000000 | NaN | NaN | 39.000000 | 1.000000 | 0.000000 | NaN | 31.275000 | NaN | NaN | NaN | 256.000000 | NaN |
max | 3.000000 | 1.000000 | NaN | NaN | 80.000000 | 8.000000 | 9.000000 | NaN | 512.329200 | NaN | NaN | NaN | 328.000000 | NaN |
# Bez include='all' jenom numerické sloupce
titanic.describe()
pclass | survived | age | sibsp | parch | fare | body | |
---|---|---|---|---|---|---|---|
count | 1309.000000 | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 | 121.000000 |
mean | 2.294882 | 0.381971 | 29.881135 | 0.498854 | 0.385027 | 33.295479 | 160.809917 |
std | 0.837836 | 0.486055 | 14.413500 | 1.041658 | 0.865560 | 51.758668 | 97.696922 |
min | 1.000000 | 0.000000 | 0.166700 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 2.000000 | 0.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 | 72.000000 |
50% | 3.000000 | 0.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 | 155.000000 |
75% | 3.000000 | 1.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 | 256.000000 |
max | 3.000000 | 1.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 | 328.000000 |
Použijte následující postup a seznamte se s těmito daty, popis dat
Načíst data
V jakém formátu jsou sloupce?
Kolik je řádků a sloupců?
Obsahují některé sloupce chybějící data?
Udělat popisnou statistiku dat
education = pd.read_csv(
'http://vincentarelbundock.github.io/Rdatasets/csv/robustbase/education.csv',
index_col=0
)
education.head()
State | Region | X1 | X2 | X3 | Y | |
---|---|---|---|---|---|---|
1 | ME | 1 | 508 | 3944 | 325 | 235 |
2 | NH | 1 | 564 | 4578 | 323 | 231 |
3 | VT | 1 | 322 | 4011 | 328 | 270 |
4 | MA | 1 | 846 | 5233 | 305 | 261 |
5 | RI | 1 | 871 | 4780 | 303 | 300 |
education.rename(columns={
'X1': 'residents', 'X2': 'income',
'X3': 'residents_young', 'Y': 'expenditure'
},
inplace=True
)
education.head()
State | Region | residents | income | residents_young | expenditure | |
---|---|---|---|---|---|---|
1 | ME | 1 | 508 | 3944 | 325 | 235 |
2 | NH | 1 | 564 | 4578 | 323 | 231 |
3 | VT | 1 | 322 | 4011 | 328 | 270 |
4 | MA | 1 | 846 | 5233 | 305 | 261 |
5 | RI | 1 | 871 | 4780 | 303 | 300 |
education.dtypes
State object Region int64 residents int64 income int64 residents_young int64 expenditure int64 dtype: object
education.shape
(50, 6)
education.isnull().any()
State False Region False residents False income False residents_young False expenditure False dtype: bool
education.describe(include='all')
State | Region | residents | income | residents_young | expenditure | |
---|---|---|---|---|---|---|
count | 50 | 50.000000 | 50.000000 | 50.000000 | 50.000000 | 50.000000 |
unique | 50 | NaN | NaN | NaN | NaN | NaN |
top | CO | NaN | NaN | NaN | NaN | NaN |
freq | 1 | NaN | NaN | NaN | NaN | NaN |
mean | NaN | 2.660000 | 657.800000 | 4675.120000 | 325.740000 | 284.600000 |
std | NaN | 1.061574 | 145.016396 | 644.506254 | 19.423119 | 61.340136 |
min | NaN | 1.000000 | 322.000000 | 3448.000000 | 287.000000 | 208.000000 |
25% | NaN | 2.000000 | 546.750000 | 4137.250000 | 310.750000 | 234.250000 |
50% | NaN | 3.000000 | 662.500000 | 4706.000000 | 324.500000 | 269.500000 |
75% | NaN | 3.750000 | 782.250000 | 5054.250000 | 333.000000 | 316.750000 |
max | NaN | 4.000000 | 909.000000 | 5889.000000 | 386.000000 | 546.000000 |