퀴즈 전체 중 일부만 풀이한 내용이다. 모르는 문제에는 주석을 첨부하여 함수가 갖고 있는 기능을 적어두었다. 기본 이상의 pandas 지식을 묻는 문제들은 답안지를 보며 코드 작성 방법을 배우고, 함수들의 기능을 구글링을 통해 배우는 등의 공부를 진행했다.
https://github.com/rougier/numpy-100
100 pandas puzzles
Inspired by 100 Numpy exerises, here are 100* short puzzles for testing your knowledge of pandas' power.
Since pandas is a large library with many different specialist features and functions, these excercises focus mainly on the fundamentals of manipulating data (indexing, grouping, aggregating, cleaning), making use of the core DataFrame and Series objects.
Many of the excerises here are stright-forward in that the solutions require no more than a few lines of code (in pandas or NumPy... don't go using pure Python or Cython!). Choosing the right methods and following best practices is the underlying goal.
The exercises are loosely divided in sections. Each section has a difficulty rating; these ratings are subjective, of course, but should be a seen as a rough guide as to how inventive the required solution is.
If you're just starting out with pandas and you are looking for some other resources, the official documentation is very extensive. In particular, some good places get a broader overview of pandas are...
Enjoy the puzzles!
Importing pandas
Getting started and checking your pandas setup
Difficulty: easy
1. Import pandas under the alias pd
.
import pandas as pd
2. Print the version of pandas that has been imported.
pd.__version__
'1.3.4'
3. Print out all the version information of the libraries that are required by the pandas library.
pd.show_versions()
INSTALLED VERSIONS
------------------
commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python : 3.9.7.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22000
machine : AMD64
processor : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : Korean_Korea.949
pandas : 1.3.4
numpy : 1.20.3
pytz : 2021.3
dateutil : 2.8.2
pip : 21.2.4
setuptools : 58.0.4
Cython : 0.29.24
pytest : 6.2.4
hypothesis : None
sphinx : 4.2.0
blosc : None
feather : None
xlsxwriter : 3.0.1
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.3
IPython : 7.29.0
pandas_datareader: None
bs4 : 4.10.0
bottleneck : 1.3.2
fsspec : 2021.10.1
fastparquet : None
gcsfs : None
matplotlib : 3.4.3
numexpr : 2.7.3
odfpy : None
openpyxl : 3.0.9
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.7.1
sqlalchemy : 1.4.22
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 2.0.1
xlwt : 1.3.0
numba : 0.54.1
DataFrame basics
A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames
Difficulty: easy
Note: remember to import numpy using:
import numpy as np
Consider the following Python dictionary data
and Python list labels
:
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
(This is just some meaningless data I made up with the theme of animals and trips to a vet.)
4. Create a DataFrame df
from this dictionary data
which has the index labels
.
import numpy as np
data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}
labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']
df = pd.DataFrame(data, index=labels)
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
e | dog | 5.0 | 2 | no |
f | cat | 2.0 | 3 | no |
g | snake | 4.5 | 1 | no |
h | cat | NaN | 1 | yes |
i | dog | 7.0 | 2 | no |
j | dog | 3.0 | 1 | no |
5. Display a summary of the basic information about this DataFrame and its data (hint: there is a single method that can be called on the DataFrame).
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 animal 10 non-null object
1 age 8 non-null float64
2 visits 10 non-null int64
3 priority 10 non-null object
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes
6. Return the first 3 rows of the DataFrame df
.
df.head(3)
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
7. Select just the 'animal' and 'age' columns from the DataFrame df
.
df.loc[:, ['age','animal']]
age | animal | |
---|---|---|
a | 2.5 | cat |
b | 3.0 | cat |
c | 0.5 | snake |
d | NaN | dog |
e | 5.0 | dog |
f | 2.0 | cat |
g | 4.5 | snake |
h | NaN | cat |
i | 7.0 | dog |
j | 3.0 | dog |
8. Select the data in rows [3, 4, 8]
and in columns ['animal', 'age']
.
df.loc[df.index[[3,4,8]], ['animal', 'age']]
animal | age | |
---|---|---|
d | dog | NaN |
e | dog | 5.0 |
i | dog | 7.0 |
9. Select only the rows where the number of visits is greater than 3.
df[df['visits'] > 3]
animal | age | visits | priority |
---|
10. Select the rows where the age is missing, i.e. it is NaN
.
df[df['age'].isnull()]
animal | age | visits | priority | |
---|---|---|---|---|
d | dog | NaN | 3 | yes |
h | cat | NaN | 1 | yes |
11. Select the rows where the animal is a cat and the age is less than 3.
df[(df['animal'] == 'cat') & (df['age'] < 3)]
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
f | cat | 2.0 | 3 | no |
12. Select the rows the age is between 2 and 4 (inclusive).
df[df['age'].between(2, 4)]
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
f | cat | 2.0 | 3 | no |
j | dog | 3.0 | 1 | no |
df[(df['age'] >= 2) & (df['age'] < 4)]
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
f | cat | 2.0 | 3 | no |
j | dog | 3.0 | 1 | no |
13. Change the age in row 'f' to 1.5.
df.loc['f', 'age'] = 1.5
14. Calculate the sum of all visits in df
(i.e. find the total number of visits).
df['visits'].sum()
19
15. Calculate the mean age for each different animal in df
.
df.groupby('animal')['age'].mean()
animal
cat 2.333333
dog 5.000000
snake 2.500000
Name: age, dtype: float64
16. Append a new row 'k' to df
with your choice of values for each column. Then delete that row to return the original DataFrame.
df.loc['k'] = ['dog', 1.5, '0', 'yes']
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
e | dog | 5.0 | 2 | no |
f | cat | 1.5 | 3 | no |
g | snake | 4.5 | 1 | no |
h | cat | NaN | 1 | yes |
i | dog | 7.0 | 2 | no |
j | dog | 3.0 | 1 | no |
k | dog | 1.5 | 0 | yes |
df = df.drop('k')
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | yes |
b | cat | 3.0 | 3 | yes |
c | snake | 0.5 | 2 | no |
d | dog | NaN | 3 | yes |
e | dog | 5.0 | 2 | no |
f | cat | 1.5 | 3 | no |
g | snake | 4.5 | 1 | no |
h | cat | NaN | 1 | yes |
i | dog | 7.0 | 2 | no |
j | dog | 3.0 | 1 | no |
17. Count the number of each type of animal in df
.
df['animal'].value_counts()
cat 4
dog 4
snake 2
Name: animal, dtype: int64
18. Sort df
first by the values in the 'age' in decending order, then by the value in the 'visits' column in ascending order (so row i
should be first, and row d
should be last).
df.sort_values(by=['age', 'visits'], ascending=[False, True])
animal | age | visits | priority | |
---|---|---|---|---|
i | dog | 7.0 | 2 | no |
e | dog | 5.0 | 2 | no |
g | snake | 4.5 | 1 | no |
j | dog | 3.0 | 1 | no |
b | cat | 3.0 | 3 | yes |
a | cat | 2.5 | 1 | yes |
f | cat | 1.5 | 3 | no |
c | snake | 0.5 | 2 | no |
h | cat | NaN | 1 | yes |
d | dog | NaN | 3 | yes |
19. The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True
and 'no' should be False
.
df['priority'] = df['priority'].map({'yes': True, 'no': False})
df
# Syntax: Series.map(arg, na_action=None).
# Return type: Pandas Series with the same as an index as a caller.
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | True |
b | cat | 3.0 | 3 | True |
c | snake | 0.5 | 2 | False |
d | dog | NaN | 3 | True |
e | dog | 5.0 | 2 | False |
f | cat | 1.5 | 3 | False |
g | snake | 4.5 | 1 | False |
h | cat | NaN | 1 | True |
i | dog | 7.0 | 2 | False |
j | dog | 3.0 | 1 | False |
20. In the 'animal' column, change the 'snake' entries to 'python'.
df['animal'] = df['animal'].replace('snake', 'python')
df
animal | age | visits | priority | |
---|---|---|---|---|
a | cat | 2.5 | 1 | True |
b | cat | 3.0 | 3 | True |
c | python | 0.5 | 2 | False |
d | dog | NaN | 3 | True |
e | dog | 5.0 | 2 | False |
f | cat | 1.5 | 3 | False |
g | python | 4.5 | 1 | False |
h | cat | NaN | 1 | True |
i | dog | 7.0 | 2 | False |
j | dog | 3.0 | 1 | False |
21. For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).
df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')
visits | 1 | 2 | 3 |
---|---|---|---|
animal | |||
cat | 2.5 | NaN | 2.25 |
dog | 3.0 | 6.0 | NaN |
python | 4.5 | 0.5 | NaN |
DataFrames: beyond the basics
Slightly trickier: you may need to combine two or more methods to get the right answer
Difficulty: medium
The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single "out of the box" method.
22. You have a DataFrame df
with a column 'A' of integers. For example:
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
How do you filter out rows which contain the same integer as the row immediately above?
You should be left with a column containing the following values:
1, 2, 3, 4, 5, 6, 7
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df.drop_duplicates(subset='A')
# drop_duplicates( ) : 중복되는 행 제거
A | |
---|---|
0 | 1 |
1 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
8 | 6 |
9 | 7 |
df.loc[df['A'].shift() != df['A']]
A | |
---|---|
0 | 1 |
1 | 2 |
3 | 3 |
4 | 4 |
5 | 5 |
8 | 6 |
9 | 7 |
23. Given a DataFrame of numeric values, say
df = pd.DataFrame(np.random.random(size=(5, 3))) # a 5x3 frame of float values
how do you subtract the row mean from each element in the row?
df = pd.DataFrame(np.random.random(size=(5, 3)))
df.sub(df.mean(axis=1), axis=0)
# sub함수는 DataFrame에 다른 데이터프레임이나, Series, 스칼라 등 데이터를 빼는 메서드다.
# DataFrame.sub(other, axis='columns', level=None, fill_value=None)
# other : 데이터프레임이나, Series, 스칼라 등 데이터가 올 수 있습니다. 뺄 값입니다.
# axis : 뺄 레이블을 설정합니다. 0은 행(index), 1은 열 입니다. ※Series일 경우 Index와 일치시킬 축
# level : multiIndex에서 계산할 Index의 레벨입니다.
# fill_value : NaN 값등의 누락 요소를 계산 전에 이 값으로 대체합니다.
0 | 1 | 2 | |
---|---|---|---|
0 | -0.121079 | -0.273447 | 0.394527 |
1 | 0.196543 | 0.016145 | -0.212688 |
2 | 0.027762 | 0.115100 | -0.142862 |
3 | -0.246590 | 0.524881 | -0.278291 |
4 | -0.348375 | 0.384489 | -0.036114 |
24. Suppose you have DataFrame with 10 columns of real numbers, for example:
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
Which column of numbers has the smallest sum? Return that column's label.
df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))
df.sum().idxmin()
#idxmin() : 최소값을 가지는 인덱스 레이블 출력
'a'
25. How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)? As input, use a DataFrame of zeros and ones with 10 rows and 3 columns.
df = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
df = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
len(df.drop_duplicates(keep=False))
# keep=False : 처음이나 끝값인지 여부는 고려를 안하고 중복이면 무조건 True를 반환
2
The next three puzzles are slightly harder.
26. In the cell below, you have a DataFrame df
that consists of 10 columns of floating-point numbers. Exactly 5 entries in each row are NaN values.
For each row of the DataFrame, find the column which contains the third NaN value.
You should return a Series of column labels: e, c, d, h, d
nan = np.nan
data = [[0.04, nan, nan, 0.25, nan, 0.43, 0.71, 0.51, nan, nan],
[ nan, nan, nan, 0.04, 0.76, nan, nan, 0.67, 0.76, 0.16],
[ nan, nan, 0.5 , nan, 0.31, 0.4 , nan, nan, 0.24, 0.01],
[0.49, nan, nan, 0.62, 0.73, 0.26, 0.85, nan, nan, nan],
[ nan, nan, 0.41, nan, 0.05, nan, 0.61, nan, 0.48, 0.68]]
columns = list('abcdefghij')
df = pd.DataFrame(data, columns=columns)
# write a solution to the question here
(df.isnull().cumsum(axis=1) == 3).idxmax(axis=1)
# cumsum : 행/열의 누적합
0 e
1 c
2 d
3 h
4 d
dtype: object
27. A DataFrame has a column of groups 'grps' and and column of integer values 'vals':
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
For each group, find the sum of the three greatest values. You should end up with the answer as follows:
grps
a 409
b 156
c 345
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'),
'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
# write a solution to the question here
df.groupby('grps')['vals'].nlargest(3).sum(level=0)
grps
a 409
b 156
c 345
Name: vals, dtype: int64
28. The DataFrame df
constructed below has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive).
For each group of 10 consecutive integers in 'A' (i.e. (0, 10]
, (10, 20]
, ...), calculate the sum of the corresponding values in column 'B'.
The answer should be a Series as follows:
A
(0, 10] 635
(10, 20] 360
(20, 30] 315
(30, 40] 306
(40, 50] 750
(50, 60] 284
(60, 70] 424
(70, 80] 526
(80, 90] 835
(90, 100] 852
df = pd.DataFrame(np.random.RandomState(8765).randint(1, 101, size=(100, 2)), columns = ["A", "B"])
# write a solution to the question here
df.groupby(pd.cut(df['A'], np.arange(0, 101, 10)))['B'].sum()
# pd.cut() : 같은 갯수로 구간 나누기
# numpy.arange([start, ] stop, [step, ] dtype=None)
# np.arange()는 반열린구간 [start, stop) 에서 step 의 크기만큼 일정하게 떨어져 있는 숫자들을 array 형태로 반환해 주는 함수
# stop 매개변수의 값은 반드시 O
# start, step 은 꼭 전달되지 않아도 됨.
# start 값이 전달되지 않았다면 0이 기본값, step 값이 전달되지 않았다면 1을 기본값
A
(0, 10] 635
(10, 20] 360
(20, 30] 315
(30, 40] 306
(40, 50] 750
(50, 60] 284
(60, 70] 424
(70, 80] 526
(80, 90] 835
(90, 100] 852
Name: B, dtype: int32
'python' 카테고리의 다른 글
[python] 소상공인시장진흥공단 상가업소정보로 프랜차이즈 입점분석 (4) (0) | 2023.01.29 |
---|---|
[python] 소상공인시장진흥공단 상가업소정보로 프랜차이즈 입점분석 (3) (0) | 2023.01.28 |
[python] jupyter notebook에서 생성한 folium map 티스토리로 불러오기 (2) | 2023.01.26 |
[python] 소상공인시장진흥공단 상가업소정보로 프랜차이즈 입점분석 (2) (1) | 2023.01.25 |
[python] 소상공인시장진흥공단 상가업소정보로 프랜차이즈 입점분석 (1) (0) | 2023.01.25 |
댓글