How to Use Pandas GroupBy, Counts and Value Counts

Alex DeBrie

July 18, 2019

Pandas groupby
Understanding your data’s shape with Pandas count and value_counts
- Pandas value_counts method
Conclusion

If you’re a data scientist, you likely spend a lot of time cleaning and manipulating data for use in your applications. One of the core libraries for preparing data is the Pandas library for Python.

In a previous post, we explored the background of Pandas and the basic usage of a Pandas DataFrame, the core data structure in Pandas. Check out that post if you want to get up to speed with the basics of Pandas.

In this post, we’ll explore a few of the core methods on Pandas DataFrames. These methods help you segment and review your DataFrames during your analysis.

We’ll cover

Using Pandas groupby to segment your DataFrame into groups.
Exploring your Pandas DataFrame with counts and value_counts.

Let’s get started.

Pandas `groupby`

Pandas is typically used for exploring and organizing large volumes of tabular data, like a super-powered Excel spreadsheet. Often, you’ll want to organize a pandas DataFrame into subgroups for further analysis.

For example, perhaps you have stock ticker data in a DataFrame, as we explored in the last post. Your Pandas DataFrame might look as follows:

>>> df

          date symbol     open     high      low    close    volume

0   2019-03-01   AMZN  1655.13  1674.26  1651.00  1671.73   4974877

1   2019-03-04   AMZN  1685.00  1709.43  1674.36  1696.17   6167358

2   2019-03-05   AMZN  1702.95  1707.80  1689.01  1692.43   3681522

3   2019-03-06   AMZN  1695.97  1697.75  1668.28  1668.95   3996001

4   2019-03-07   AMZN  1667.37  1669.75  1620.51  1625.95   4957017

5   2019-03-01   AAPL   174.28   175.15   172.89   174.97  25886167

6   2019-03-04   AAPL   175.69   177.75   173.97   175.85  27436203

7   2019-03-05   AAPL   175.94   176.00   174.54   175.53  19737419

8   2019-03-06   AAPL   174.67   175.49   173.94   174.52  20810384

9   2019-03-07   AAPL   173.87   174.44   172.02   172.50  24796374

10  2019-03-01   GOOG  1124.90  1142.97  1124.75  1140.99   1450316

11  2019-03-04   GOOG  1146.99  1158.28  1130.69  1147.80   1446047

12  2019-03-05   GOOG  1150.06  1169.61  1146.19  1162.03   1443174

13  2019-03-06   GOOG  1162.49  1167.57  1155.49  1157.86   1099289

14  2019-03-07   GOOG  1155.72  1156.76  1134.91  1143.30   1166559

Perhaps we want to analyze this stock information on a symbol-by-symbol basis rather than combining Amazon (“AMZN”) data with Google (“GOOG”) data or that of Apple (“AAPL”).

This is where the Pandas groupby method is useful. You can use groupby to chunk up your data into subsets for further analysis.

Basic Pandas `groupby` usage

Let’s do some basic usage of groupby to see how it’s helpful.

In your Python interpreter, enter the following commands:

>>> import pandas as pd

>>> import numpy as np

>>> url = 'https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv'

>>> df = pd.read_csv(url)

>>> df

          date symbol     open     high      low    close    volume

0   2019-03-01   AMZN  1655.13  1674.26  1651.00  1671.73   4974877

1   2019-03-04   AMZN  1685.00  1709.43  1674.36  1696.17   6167358

2   2019-03-05   AMZN  1702.95  1707.80  1689.01  1692.43   3681522

3   2019-03-06   AMZN  1695.97  1697.75  1668.28  1668.95   3996001

4   2019-03-07   AMZN  1667.37  1669.75  1620.51  1625.95   4957017

5   2019-03-01   AAPL   174.28   175.15   172.89   174.97  25886167

6   2019-03-04   AAPL   175.69   177.75   173.97   175.85  27436203

7   2019-03-05   AAPL   175.94   176.00   174.54   175.53  19737419

8   2019-03-06   AAPL   174.67   175.49   173.94   174.52  20810384

9   2019-03-07   AAPL   173.87   174.44   172.02   172.50  24796374

10  2019-03-01   GOOG  1124.90  1142.97  1124.75  1140.99   1450316

11  2019-03-04   GOOG  1146.99  1158.28  1130.69  1147.80   1446047

12  2019-03-05   GOOG  1150.06  1169.61  1146.19  1162.03   1443174

13  2019-03-06   GOOG  1162.49  1167.57  1155.49  1157.86   1099289

14  2019-03-07   GOOG  1155.72  1156.76  1134.91  1143.30   1166559

In the steps above, we’re importing the Pandas and NumPy libraries, then setting up a basic DataFrame by downloading CSV data from a URL. We print our DataFrame to the console to see what we have.

Now, let’s group our DataFrame using the stock symbol. The easiest and most common way to use groupby is by passing one or more column names. For our example, we’ll use “symbol” as the column name for grouping:

>>> symbols = df.groupby('symbol')

>>> print(symbols.groups)

{'AAPL': Int64Index([5, 6, 7, 8, 9], dtype='int64'),

 'AMZN': Int64Index([0, 1, 2, 3, 4], dtype='int64'),

 'GOOG': Int64Index([10, 11, 12, 13, 14], dtype='int64')}

Interpreting the output from the printed groups can be a little hard to understand. In the output above, it’s showing that we have three groups: AAPL, AMZN, and GOOG. For each group, it includes an index to the rows in the original DataFrame that belong to each group.

The input to groupby is quite flexible. You can choose to group by multiple columns. For example, if we had a year column available, we could group by both stock symbol and year to perform year-over-year analysis on our stock data.

Want to Code Faster?

Kite is a plugin for PyCharm, Atom, Vim, VSCode, Sublime Text, and IntelliJ that uses machine learning to provide you with code completions in real time sorted by relevance. Start coding faster today.

Download Kite Free

Using a custom function in Pandas `groupby`

In the previous example, we passed a column name to the groupby method. You can also pass your own function to the groupby method. This function will receive an index number for each row in the DataFrame and should return a value that will be used for grouping. This can provide significant flexibility for grouping rows using complex logic.

As an example, imagine we want to group our rows depending on whether the stock price increased on that particular day. We would use the following:

>>> def increased(idx):

...     return df.loc[idx].close > df.loc[idx].open

...

>>> df.groupby(increased).groups

{False: Int64Index([2, 3, 4, 7, 8, 9, 13, 14], dtype='int64'),

 True: Int64Index([0, 1, 5, 6, 10, 11, 12], dtype='int64')}

First, we would define a function called increased, which receives an index. It returns True if the close value for that row in the DataFrame is higher than the open value; otherwise, it returns False.

When we pass that function into the groupby() method, our DataFrame is grouped into two groups based on whether the stock’s closing price was higher than the opening price on the given day.

Operating on Pandas groups

After you’ve created your groups using the groupby function, you can perform some handy data manipulation on the resulting groups.

In our example above, we created groups of our stock tickers by symbol. Let’s now find the mean trading volume for each symbol.

>>> symbols['volume'].agg(np.mean)

symbol

AAPL    23733309.4

AMZN     4755355.0

GOOG     1321077.0

Name: volume, dtype: float64

To complete this task, you specify the column on which you want to operate—volume—then use Pandas’ agg method to apply NumPy’s mean function. The result is the mean volume for each of the three symbols. From this, we can see that AAPL’s trading volume is an order of magnitude larger than AMZN and GOOG’s trading volume.

Iteration and selecting groups

Iteration is a core programming pattern, and few languages have nicer syntax for iteration than Python. Python’s built-in list comprehensions and generators make iteration a breeze.

Pandas groupby is no different, as it provides excellent support for iteration. You can loop over the groupby result object using a for loop:

>>> for symbol, group in symbols:

...     print(symbol)

...     print(group)

...

AAPL

         date symbol    open    high     low   close    volume

5  2019-03-01   AAPL  174.28  175.15  172.89  174.97  25886167

6  2019-03-04   AAPL  175.69  177.75  173.97  175.85  27436203

7  2019-03-05   AAPL  175.94  176.00  174.54  175.53  19737419

8  2019-03-06   AAPL  174.67  175.49  173.94  174.52  20810384

9  2019-03-07   AAPL  173.87  174.44  172.02  172.50  24796374

AMZN

         date symbol     open     high      low    close   volume

0  2019-03-01   AMZN  1655.13  1674.26  1651.00  1671.73  4974877

1  2019-03-04   AMZN  1685.00  1709.43  1674.36  1696.17  6167358

2  2019-03-05   AMZN  1702.95  1707.80  1689.01  1692.43  3681522

3  2019-03-06   AMZN  1695.97  1697.75  1668.28  1668.95  3996001

4  2019-03-07   AMZN  1667.37  1669.75  1620.51  1625.95  4957017

GOOG

          date symbol     open     high      low    close   volume

10  2019-03-01   GOOG  1124.90  1142.97  1124.75  1140.99  1450316

11  2019-03-04   GOOG  1146.99  1158.28  1130.69  1147.80  1446047

12  2019-03-05   GOOG  1150.06  1169.61  1146.19  1162.03  1443174

13  2019-03-06   GOOG  1162.49  1167.57  1155.49  1157.86  1099289

14  2019-03-07   GOOG  1155.72  1156.76  1134.91  1143.30  1166559

Each iteration on the groupby object will return two values. The first value is the identifier of the group, which is the value for the column(s) on which they were grouped. The second value is the group itself, which is a Pandas DataFrame object.

Pandas get_group method

If you want more flexibility to manipulate a single group, you can use the get_group method to retrieve a single group.

>>> aapl = symbols.get_group('AAPL')
>>> aapl
         date symbol    open    high     low   close    volume
5  2019-03-01   AAPL  174.28  175.15  172.89  174.97  25886167
6  2019-03-04   AAPL  175.69  177.75  173.97  175.85  27436203
7  2019-03-05   AAPL  175.94  176.00  174.54  175.53  19737419
8  2019-03-06   AAPL  174.67  175.49  173.94  174.52  20810384
9  2019-03-07   AAPL  173.87  174.44  172.02  172.50  24796374
>>> type(aapl)
<class 'pandas.core.frame.DataFrame'>

In the example above, we use the Pandas get_group method to retrieve all AAPL rows. To retrieve a particular group, you pass the identifier of the group into the get_group method. This method returns a Pandas DataFrame, which we can manipulate as needed.

Want to Code Faster?

Download Kite Free

Understanding your data’s shape with Pandas `count` and `value_counts`

If you’re working with a large DataFrame, you’ll need to use various heuristics for understanding the shape of your data. In this section, we’ll look at Pandas count and value_counts, two methods for evaluating your DataFrame.

The count method will show you the number of values for each column in your DataFrame. Using our DataFrame from above, we get the following output:

>>> df.count()

date      15

symbol    15

open      15

high      15

low       15

close     15

volume    15

dtype: int64

The output isn’t particularly helpful for us, as each of our 15 rows has a value for every column. However, this can be very useful where your data set is missing a large number of values. Using the count method can help to identify columns that are incomplete. From there, you can decide whether to exclude the columns from your processing or to provide default values where necessary.

Pandas `value_counts` method

For our case, value_counts method is more useful. This method will return the number of unique values for a particular column. If you have continuous variables, like our columns, you can provide an optional “bins” argument to separate the values into half-open bins.

Let’s use the Pandas value_counts method to view the shape of our volume column.

>>> df['volume'].value_counts(bins=4)

(1072952.085, 7683517.5]    10

(20851974.5, 27436203.0]     3

(14267746.0, 20851974.5]     2

(7683517.5, 14267746.0]      0

Name: volume, dtype: int64

In the output above, Pandas has created four separate bins for our volume column and shows us the number of rows that land in each bin.

Both counts() and value_counts() are great utilities for quickly understanding the shape of your data.

Conclusion

In this post, we learned about groupby, count, and value_counts – three of the main methods in Pandas.

Pandas is a powerful tool for manipulating data once you know the core operations and how to use it. New to Pandas or Python? Download Kite to supercharge your workflow. Kite provides line-of-code completions while you’re typing for faster development, as well as examples of how others are using the same methods.

Head over to our YouTube channel for more Python tips, tutorials and project breakdowns like this one:

View Article Comments

How to Use Pandas GroupBy, Counts and Value Counts

Table of Contents

Pandas `groupby`

Basic Pandas `groupby` usage

Using a custom function in Pandas `groupby`

Operating on Pandas groups

Iteration and selecting groups

Pandas get_group method

Understanding your data’s shape with Pandas `count` and `value_counts`

Pandas `value_counts` method

Conclusion

Head over to our YouTube channel for more Python tips, tutorials and project breakdowns like this one:

Company

Product

Resources

Stay in touch

How to Use Pandas GroupBy, Counts and Value Counts

Table of Contents

Pandas groupby

Basic Pandas groupby usage

Using a custom function in Pandas groupby

Operating on Pandas groups

Iteration and selecting groups

Pandas get_group method

Understanding your data’s shape with Pandas count and value_counts

Pandas value_counts method

Conclusion

Head over to our YouTube channel for more Python tips, tutorials and project breakdowns like this one:

Company

Product

Resources

Stay in touch

Pandas `groupby`

Basic Pandas `groupby` usage

Using a custom function in Pandas `groupby`

Understanding your data’s shape with Pandas `count` and `value_counts`

Pandas `value_counts` method