Importing pandas
- Getting started and checking your pandas setup
DataFrame basics
- A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames
DataFrames: beyond the basics
- Slightly trickier: you may need to combine two or more methods to get the right answer
  - 'Data Analysis > python' 카테고리의 다른 글

퀴즈 전체 중 일부만 풀이한 내용이다. 모르는 문제에는 주석을 첨부하여 함수가 갖고 있는 기능을 적어두었다. 기본 이상의 pandas 지식을 묻는 문제들은 답안지를 보며 코드 작성 방법을 배우고, 함수들의 기능을 구글링을 통해 배우는 등의 공부를 진행했다.

[Python 완전정복 시리즈] 2편 : Pandas DataFrame 완전정복

안녕하세요! 파이썬 완전정복 시리즈의 저자 김태준 입니다. 파이썬 완전 정복 시리즈의 궁극적인 목표는 자신만의 알고리즘 트레이딩 프로그램 만들기 입니다. 본 도서는 독자…

wikidocs.net

https://github.com/rougier/numpy-100

GitHub - rougier/numpy-100: 100 numpy exercises (with solutions)

100 numpy exercises (with solutions). Contribute to rougier/numpy-100 development by creating an account on GitHub.

github.com

100 pandas puzzles

Inspired by 100 Numpy exerises, here are 100* short puzzles for testing your knowledge of pandas' power.

Since pandas is a large library with many different specialist features and functions, these excercises focus mainly on the fundamentals of manipulating data (indexing, grouping, aggregating, cleaning), making use of the core DataFrame and Series objects.

Many of the excerises here are stright-forward in that the solutions require no more than a few lines of code (in pandas or NumPy... don't go using pure Python or Cython!). Choosing the right methods and following best practices is the underlying goal.

The exercises are loosely divided in sections. Each section has a difficulty rating; these ratings are subjective, of course, but should be a seen as a rough guide as to how inventive the required solution is.

If you're just starting out with pandas and you are looking for some other resources, the official documentation is very extensive. In particular, some good places get a broader overview of pandas are...

Enjoy the puzzles!

Importing pandas

Getting started and checking your pandas setup

Difficulty: easy

1. Import pandas under the alias pd.

In [1]:

import pandas as pd

2. Print the version of pandas that has been imported.

In [2]:

pd.__version__

Out[2]:

'1.3.4'

3. Print out all the version information of the libraries that are required by the pandas library.

In [3]:

pd.show_versions()

INSTALLED VERSIONS
------------------
commit           : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5
python           : 3.9.7.final.0
python-bits      : 64
OS               : Windows
OS-release       : 10
Version          : 10.0.22000
machine          : AMD64
processor        : Intel64 Family 6 Model 140 Stepping 1, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : None
LOCALE           : Korean_Korea.949

pandas           : 1.3.4
numpy            : 1.20.3
pytz             : 2021.3
dateutil         : 2.8.2
pip              : 21.2.4
setuptools       : 58.0.4
Cython           : 0.29.24
pytest           : 6.2.4
hypothesis       : None
sphinx           : 4.2.0
blosc            : None
feather          : None
xlsxwriter       : 3.0.1
lxml.etree       : 4.6.3
html5lib         : 1.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.3
IPython          : 7.29.0
pandas_datareader: None
bs4              : 4.10.0
bottleneck       : 1.3.2
fsspec           : 2021.10.1
fastparquet      : None
gcsfs            : None
matplotlib       : 3.4.3
numexpr          : 2.7.3
odfpy            : None
openpyxl         : 3.0.9
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.7.1
sqlalchemy       : 1.4.22
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 2.0.1
xlwt             : 1.3.0
numba            : 0.54.1

DataFrame basics

A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames

Difficulty: easy

Note: remember to import numpy using:

import numpy as np

Consider the following Python dictionary data and Python list labels:

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

(This is just some meaningless data I made up with the theme of animals and trips to a vet.)

4. Create a DataFrame df from this dictionary data which has the index labels.

In [4]:

import numpy as np

data = {'animal': ['cat', 'cat', 'snake', 'dog', 'dog', 'cat', 'snake', 'cat', 'dog', 'dog'],
        'age': [2.5, 3, 0.5, np.nan, 5, 2, 4.5, np.nan, 7, 3],
        'visits': [1, 3, 2, 3, 2, 3, 1, 1, 2, 1],
        'priority': ['yes', 'yes', 'no', 'yes', 'no', 'no', 'no', 'yes', 'no', 'no']}

labels = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j']

df = pd.DataFrame(data, index=labels)
df

Out[4]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	2.0	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

5. Display a summary of the basic information about this DataFrame and its data (hint: there is a single method that can be called on the DataFrame).

In [5]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   animal    10 non-null     object 
 1   age       8 non-null      float64
 2   visits    10 non-null     int64  
 3   priority  10 non-null     object 
dtypes: float64(1), int64(1), object(2)
memory usage: 400.0+ bytes

6. Return the first 3 rows of the DataFrame df.

In [6]:

df.head(3)

Out[6]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no

7. Select just the 'animal' and 'age' columns from the DataFrame df.

In [7]:

df.loc[:, ['age','animal']]

Out[7]:

	age	animal
a	2.5	cat
b	3.0	cat
c	0.5	snake
d	NaN	dog
e	5.0	dog
f	2.0	cat
g	4.5	snake
h	NaN	cat
i	7.0	dog
j	3.0	dog

8. Select the data in rows [3, 4, 8] and in columns ['animal', 'age'].

In [8]:

df.loc[df.index[[3,4,8]], ['animal', 'age']]

Out[8]:

	animal	age
d	dog	NaN
e	dog	5.0
i	dog	7.0

9. Select only the rows where the number of visits is greater than 3.

In [9]:

df[df['visits'] > 3]

Out[9]:

	animal	age	visits	priority

10. Select the rows where the age is missing, i.e. it is NaN.

In [10]:

df[df['age'].isnull()]

Out[10]:

	animal	age	visits	priority
d	dog	NaN	3	yes
h	cat	NaN	1	yes

11. Select the rows where the animal is a cat and the age is less than 3.

In [11]:

df[(df['animal'] == 'cat') & (df['age'] < 3)]

Out[11]:

	animal	age	visits	priority
a	cat	2.5	1	yes
f	cat	2.0	3	no

12. Select the rows the age is between 2 and 4 (inclusive).

In [12]:

df[df['age'].between(2, 4)]

Out[12]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
f	cat	2.0	3	no
j	dog	3.0	1	no

In [13]:

df[(df['age'] >= 2) & (df['age'] < 4)]

Out[13]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
f	cat	2.0	3	no
j	dog	3.0	1	no

13. Change the age in row 'f' to 1.5.

In [14]:

df.loc['f', 'age'] = 1.5

14. Calculate the sum of all visits in df (i.e. find the total number of visits).

In [15]:

df['visits'].sum()

Out[15]:

15. Calculate the mean age for each different animal in df.

In [16]:

df.groupby('animal')['age'].mean()

Out[16]:

animal
cat      2.333333
dog      5.000000
snake    2.500000
Name: age, dtype: float64

16. Append a new row 'k' to df with your choice of values for each column. Then delete that row to return the original DataFrame.

In [17]:

df.loc['k'] =  ['dog', 1.5, '0', 'yes']
df

Out[17]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	1.5	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no
k	dog	1.5	0	yes

In [18]:

df = df.drop('k')
df

Out[18]:

	animal	age	visits	priority
a	cat	2.5	1	yes
b	cat	3.0	3	yes
c	snake	0.5	2	no
d	dog	NaN	3	yes
e	dog	5.0	2	no
f	cat	1.5	3	no
g	snake	4.5	1	no
h	cat	NaN	1	yes
i	dog	7.0	2	no
j	dog	3.0	1	no

17. Count the number of each type of animal in df.

In [19]:

df['animal'].value_counts()

Out[19]:

cat      4
dog      4
snake    2
Name: animal, dtype: int64

18. Sort df first by the values in the 'age' in decending order, then by the value in the 'visits' column in ascending order (so row i should be first, and row d should be last).

In [20]:

df.sort_values(by=['age', 'visits'], ascending=[False, True])

Out[20]:

	animal	age	visits	priority
i	dog	7.0	2	no
e	dog	5.0	2	no
g	snake	4.5	1	no
j	dog	3.0	1	no
b	cat	3.0	3	yes
a	cat	2.5	1	yes
f	cat	1.5	3	no
c	snake	0.5	2	no
h	cat	NaN	1	yes
d	dog	NaN	3	yes

19. The 'priority' column contains the values 'yes' and 'no'. Replace this column with a column of boolean values: 'yes' should be True and 'no' should be False.

In [21]:

df['priority'] = df['priority'].map({'yes': True, 'no': False})
df

# Syntax: Series.map(arg, na_action=None). 
# Return type: Pandas Series with the same as an index as a caller. 

Out[21]:

	animal	age	visits	priority
a	cat	2.5	1	True
b	cat	3.0	3	True
c	snake	0.5	2	False
d	dog	NaN	3	True
e	dog	5.0	2	False
f	cat	1.5	3	False
g	snake	4.5	1	False
h	cat	NaN	1	True
i	dog	7.0	2	False
j	dog	3.0	1	False

20. In the 'animal' column, change the 'snake' entries to 'python'.

In [22]:

df['animal'] = df['animal'].replace('snake', 'python')
df

Out[22]:

	animal	age	visits	priority
a	cat	2.5	1	True
b	cat	3.0	3	True
c	python	0.5	2	False
d	dog	NaN	3	True
e	dog	5.0	2	False
f	cat	1.5	3	False
g	python	4.5	1	False
h	cat	NaN	1	True
i	dog	7.0	2	False
j	dog	3.0	1	False

21. For each animal type and each number of visits, find the mean age. In other words, each row is an animal, each column is a number of visits and the values are the mean ages (hint: use a pivot table).

In [23]:

df.pivot_table(index='animal', columns='visits', values='age', aggfunc='mean')

Out[23]:

visits	1	2	3
animal
cat	2.5	NaN	2.25
dog	3.0	6.0	NaN
python	4.5	0.5	NaN

DataFrames: beyond the basics

Slightly trickier: you may need to combine two or more methods to get the right answer

Difficulty: medium

The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single "out of the box" method.

22. You have a DataFrame df with a column 'A' of integers. For example:

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

How do you filter out rows which contain the same integer as the row immediately above?

You should be left with a column containing the following values:

1, 2, 3, 4, 5, 6, 7

In [24]:

df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})

df.drop_duplicates(subset='A')

# drop_duplicates( ) : 중복되는 행 제거

Out[24]:

	A
0	1
1	2
3	3
4	4
5	5
8	6
9	7

In [25]:

df.loc[df['A'].shift() != df['A']]

Out[25]:

	A
0	1
1	2
3	3
4	4
5	5
8	6
9	7

23. Given a DataFrame of numeric values, say

df = pd.DataFrame(np.random.random(size=(5, 3))) # a 5x3 frame of float values

how do you subtract the row mean from each element in the row?

In [26]:

df = pd.DataFrame(np.random.random(size=(5, 3)))

df.sub(df.mean(axis=1), axis=0)

# sub함수는 DataFrame에 다른 데이터프레임이나, Series, 스칼라 등 데이터를 빼는 메서드다.
# DataFrame.sub(other, axis='columns', level=None, fill_value=None)
# other : 데이터프레임이나, Series, 스칼라 등 데이터가 올 수 있습니다. 뺄 값입니다.
# axis : 뺄 레이블을 설정합니다. 0은 행(index), 1은 열 입니다. ※Series일 경우 Index와 일치시킬 축
# level : multiIndex에서 계산할 Index의 레벨입니다.
# fill_value : NaN 값등의 누락 요소를 계산 전에 이 값으로 대체합니다.

Out[26]:

	0	1	2
0	-0.121079	-0.273447	0.394527
1	0.196543	0.016145	-0.212688
2	0.027762	0.115100	-0.142862
3	-0.246590	0.524881	-0.278291
4	-0.348375	0.384489	-0.036114

24. Suppose you have DataFrame with 10 columns of real numbers, for example:

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))

Which column of numbers has the smallest sum? Return that column's label.

In [27]:

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list('abcdefghij'))

df.sum().idxmin()

#idxmin() : 최소값을 가지는 인덱스 레이블 출력

Out[27]:

'a'

25. How do you count how many unique rows a DataFrame has (i.e. ignore all rows that are duplicates)? As input, use a DataFrame of zeros and ones with 10 rows and 3 columns.

df = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))

In [28]:

df = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))

len(df.drop_duplicates(keep=False))

# keep=False : 처음이나 끝값인지 여부는 고려를 안하고 중복이면 무조건 True를 반환

Out[28]:

The next three puzzles are slightly harder.

26. In the cell below, you have a DataFrame df that consists of 10 columns of floating-point numbers. Exactly 5 entries in each row are NaN values.

For each row of the DataFrame, find the column which contains the third NaN value.

You should return a Series of column labels: e, c, d, h, d

In [29]:

nan = np.nan

data = [[0.04,  nan,  nan, 0.25,  nan, 0.43, 0.71, 0.51,  nan,  nan],
        [ nan,  nan,  nan, 0.04, 0.76,  nan,  nan, 0.67, 0.76, 0.16],
        [ nan,  nan, 0.5 ,  nan, 0.31, 0.4 ,  nan,  nan, 0.24, 0.01],
        [0.49,  nan,  nan, 0.62, 0.73, 0.26, 0.85,  nan,  nan,  nan],
        [ nan,  nan, 0.41,  nan, 0.05,  nan, 0.61,  nan, 0.48, 0.68]]

columns = list('abcdefghij')

df = pd.DataFrame(data, columns=columns)

# write a solution to the question here
(df.isnull().cumsum(axis=1) == 3).idxmax(axis=1)

# cumsum : 행/열의 누적합

Out[29]:

0    e
1    c
2    d
3    h
4    d
dtype: object

27. A DataFrame has a column of groups 'grps' and and column of integer values 'vals':

df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})

For each group, find the sum of the three greatest values. You should end up with the answer as follows:

grps
a    409
b    156
c    345

In [30]:

df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})

# write a solution to the question here
df.groupby('grps')['vals'].nlargest(3).sum(level=0)

Out[30]:

grps
a    409
b    156
c    345
Name: vals, dtype: int64

28. The DataFrame df constructed below has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive).

For each group of 10 consecutive integers in 'A' (i.e. (0, 10], (10, 20], ...), calculate the sum of the corresponding values in column 'B'.

The answer should be a Series as follows:

A
(0, 10]      635
(10, 20]     360
(20, 30]     315
(30, 40]     306
(40, 50]     750
(50, 60]     284
(60, 70]     424
(70, 80]     526
(80, 90]     835
(90, 100]    852

In [31]:

df = pd.DataFrame(np.random.RandomState(8765).randint(1, 101, size=(100, 2)), columns = ["A", "B"])

# write a solution to the question here
df.groupby(pd.cut(df['A'], np.arange(0, 101, 10)))['B'].sum()

# pd.cut() : 같은 갯수로 구간 나누기
# numpy.arange([start, ] stop, [step, ] dtype=None)
# np.arange()는 반열린구간 [start, stop) 에서 step 의 크기만큼 일정하게 떨어져 있는 숫자들을 array 형태로 반환해 주는 함수
# stop 매개변수의 값은 반드시 O
# start, step 은 꼭 전달되지 않아도 됨.
# start 값이 전달되지 않았다면 0이 기본값, step 값이 전달되지 않았다면 1을 기본값

Out[31]:

A
(0, 10]      635
(10, 20]     360
(20, 30]     315
(30, 40]     306
(40, 50]     750
(50, 60]     284
(60, 70]     424
(70, 80]     526
(80, 90]     835
(90, 100]    852
Name: B, dtype: int32

'Data Analysis > python' 카테고리의 다른 글

[python] 소상공인시장진흥공단 상가업소정보로 프랜차이즈 입점분석 (4) (0)	2023.01.29
[python] 소상공인시장진흥공단 상가업소정보로 프랜차이즈 입점분석 (3) (0)	2023.01.28
[python] jupyter notebook에서 생성한 folium map 티스토리로 불러오기 (2)	2023.01.26
[python] 소상공인시장진흥공단 상가업소정보로 프랜차이즈 입점분석 (2) (1)	2023.01.25
[python] 소상공인시장진흥공단 상가업소정보로 프랜차이즈 입점분석 (1) (0)	2023.01.25

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

통계먼지

[python] 100 pandas puzzles

100 pandas puzzles

Importing pandas

Getting started and checking your pandas setup

DataFrame basics

A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames

DataFrames: beyond the basics

Slightly trickier: you may need to combine two or more methods to get the right answer

'Data Analysis > python' 카테고리의 다른 글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[python] 100 pandas puzzles

100 pandas puzzles

Importing pandas

Getting started and checking your pandas setup

DataFrame basics

A few of the fundamental routines for selecting, sorting, adding and aggregating data in DataFrames

DataFrames: beyond the basics

Slightly trickier: you may need to combine two or more methods to get the right answer

'Data Analysis > python' 카테고리의 다른 글

관련글

댓글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역