Pandas Merge, Join, and Concat: How To and Examples

Alex DeBrie

May 3, 2019

When to use the Pandas concat vs. merge and join
How to use Pandas’ concat method
Using Pandas’ merge and join to combine DataFrames
- Using a left join with Pandas join method
- Using an inner join with Pandas join method
Conclusion

Python’s Pandas library is a popular library for cleaning, manipulating, and interpreting large amounts of data. In an earlier post, we looked at the foundational structure in Pandas—the DataFrame. If you’re unfamiliar with Pandas DataFrames, take a look at that post to understand the basics.

In this post, we’ll learn how to combine multiple DataFrames using Pandas merge, join, and concat. These methods let you supercharge your data by gluing together data from different sources.

We’ll cover:

When to use concat and when to use merge or join;
How to use Pandas’ concat method;
How to use Pandas’ merge and join.

Let’s get started.

When to use the Pandas `concat` vs. `merge` and `join`

While merge, join, and concat all work to combine multiple DataFrames, they are used for very different things. In this section, we’ll learn when you will want to use one operation over another. The key distinction is whether you want to combine your DataFrames horizontally or vertically.

The concat method allows you to combine DataFrames vertically.

Imagine you had two DataFrames with the same columns. Perhaps the first DataFrame includes 10 rows of stock trading data for one stock while the second DataFrame includes 10 rows of stock trading data for a different stock. A vertical combination would use a DataFrame’s concat method to combine the two DataFrames into a single DataFrame with twenty rows.

Notice that in a vertical combination with concat, the number of rows has increased but the number of columns has stayed the same.

By contrast, the merge and join methods help to combine DataFrames horizontally.

Imagine you have two DataFrames. The first contains stock trading information various companies. The second contains information about the headquarters and numbers of employees for a particular company. If the two DataFrames have one field in common—such as a stock symbol or company name—you can combine the two DataFrames so that each row contains both the stock trading data and the company background information.

Notice that in this horizontal combination, we aren’t adding any additional rows. Rather, we’re adding columns to existing rows. The horizontal combination from a merge operation is similar to a JOIN operator in SQL.

Now that we understand the difference between vertical combinations with concat and horizontal combinations with merge or join, let’s take a deeper look at how to use these methods.

How to use Pandas’ `concat` method

In our previous post on Pandas DataFrames, we used sample stock data to show create, explore, and manipulate DataFrames. We used sample stock trading data from Amazon (“AMZN”), Apple (“AAPL”) and Google (“GOOG”).

Imagine we had a second data set with trading information on two additional companies, Facebook (“FB”) and Tesla (“TSLA”). These DataFrames have the same shape, so it would be useful to combine them to operate on them together.

We can do this with the concat method as follows:

>>> import pandas as pd

>>> stocks = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv')

>>> stocks.info()

< class 'pandas.core.frame.DataFrame'>

RangeIndex: 15 entries, 0 to 14

Data columns (total 7 columns):

date      15 non-null object

symbol    15 non-null object

open      15 non-null float64

high      15 non-null float64

low       15 non-null float64

close     15 non-null float64

volume    15 non-null int64

dtypes: float64(4), int64(1), object(2)

memory usage: 920.0+ bytes

>>> stocks2 = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/53ebac036b404875ef8e981c0cbd0901/raw/6c70336130eb7e45cec167ee7cd52d15baa392ea/stocks2.csv')

>>> combined = pd.concat([stocks, stock2], ignore_index=True)

>>> combined

          date symbol     open       high      low    close    volume

0   2019-03-01   AMZN  1655.13  1674.2600  1651.00  1671.73   4974877

1   2019-03-04   AMZN  1685.00  1709.4300  1674.36  1696.17   6167358

2   2019-03-05   AMZN  1702.95  1707.8000  1689.01  1692.43   3681522

3   2019-03-06   AMZN  1695.97  1697.7500  1668.28  1668.95   3996001

4   2019-03-07   AMZN  1667.37  1669.7500  1620.51  1625.95   4957017

5   2019-03-01   AAPL   174.28   175.1500   172.89   174.97  25886167

6   2019-03-04   AAPL   175.69   177.7500   173.97   175.85  27436203

7   2019-03-05   AAPL   175.94   176.0000   174.54   175.53  19737419

8   2019-03-06   AAPL   174.67   175.4900   173.94   174.52  20810384

9   2019-03-07   AAPL   173.87   174.4400   172.02   172.50  24796374

10  2019-03-01   GOOG  1124.90  1142.9700  1124.75  1140.99   1450316

11  2019-03-04   GOOG  1146.99  1158.2800  1130.69  1147.80   1446047

12  2019-03-05   GOOG  1150.06  1169.6100  1146.19  1162.03   1443174

13  2019-03-06   GOOG  1162.49  1167.5700  1155.49  1157.86   1099289

14  2019-03-07   GOOG  1155.72  1156.7600  1134.91  1143.30   1166559

15  2019-03-01     FB   162.60   163.1320   161.69   162.28  11097770

16  2019-03-04     FB   163.90   167.5000   163.83   167.37  18894689

17  2019-03-05     FB   167.37   171.8800   166.55   171.26  28187890

18  2019-03-06     FB   172.90   173.5700   171.27   172.51  21531723

19  2019-03-07     FB   171.50   171.7400   167.61   169.13  18306504

20  2019-03-01   TSLA   306.94   307.1300   291.90   294.79  22911375

21  2019-03-04   TSLA   298.12   299.0000   282.78   285.36  17096818

22  2019-03-05   TSLA   282.00   284.0000   270.10   276.54  18764740

23  2019-03-06   TSLA   276.48   281.5058   274.39   276.24  10335485

24  2019-03-07   TSLA   278.84   284.7000   274.25   276.59   9442483

In the example above, we create our first DataFrame and use the DataFrame’s info method to see that there are 15 rows and seven columns in our DataFrame. Then, we create our second DataFrame and combine it with our first DataFrame using the concat method. After this, we can see our DataFrame includes information on all five companies in across our two DataFrames.

If we use the info method to look at our new DataFrame, we can see that there are still seven columns but now there are 25 rows—the combination of the two DataFrames.

>>> combined.info()

< class 'pandas.core.frame.DataFrame'>

Int64Index: 25 entries, 0 to 9

Data columns (total 7 columns):

date      25 non-null object

symbol    25 non-null object

open      25 non-null float64

high      25 non-null float64

low       25 non-null float64

close     25 non-null float64

volume    25 non-null int64

dtypes: float64(4), int64(1), object(2)

memory usage: 1.6+ KB

The concat method is a great way to combine multiple DataFrames that contain similar data shapes. Note that the columns don’t need to match entirely—Pandas will simply include a null value for columns without values.

Want to Code Faster?

Kite is a plugin for PyCharm, Atom, Vim, VSCode, Sublime Text, and IntelliJ that uses machine learning to provide you with code completions in real time sorted by relevance. Start coding faster today.

Download Kite Free

Using Pandas’ `merge` and `join` to combine DataFrames

The merge and join methods are a pair of methods to horizontally combine DataFrames with Pandas. This is a great way to enrich with DataFrame with the data from another DataFrame.

Both merge and join are operating in similar ways, but the join method is a convenience method to make it easier to combine DataFrames. In fact, join is using merge under the hood. I prefer to use join where possible as it’s slightly easier syntax.

When using either merge or join, you’ll need to specify how the DataFrames should be merged or joined. There are four possible values for how to join two DataFrames:

Left: Retain all rows for the first DataFrame and enrich with columns from the second DataFrame where they match on the columns on which to join;
Right: Same as left but reversed—retain all rows for the second DataFrame and enrich with columns from matches in the first DataFrame.
Inner: Retain only the intersection of the two DataFrames—rows in which there are values in both DataFrames for the columns on which the join is performed.
Outer: Retain all rows from both DataFrames regardless of whether there are matching rows in the other DataFrame.

These four join types match up with the four types of SQL joins. Note that the default value for how in the merge method is inner, while the default value for how in the join method is left. In general, it’s best to explicitly specify the type of join you want for clarity.

To understand how the different types of joins work, let’s walk through some examples.

Using a left join with Pandas `join` method

First, let’s see how to use the left join type. We will continue to use our combined stock data from the concat section above. Imagine we want to enrich that data with background information about the company, such as how many employees they have as well as where the company’s headquarters is located.

Let’s load our company background information, then join it with our trading data using a left join.

>>> companies = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/90954a10bbdebd89185f7b4d340563cd/raw/d91744e599356b7a52b9c027751d4613c06c5f40/companies.csv')

>>> companies

  symbol  employees headquarters_city headquarters_state

0   AMZN     613300           Seattle                 WA

1   GOOG      98771     Mountain View                 CA

2   AAPL     132000         Cupertino                 CA

>>> enriched = combined.join(companies.set_index('symbol'), on='symbol', how='left')

>>> enriched

          date symbol     open       high      low    close    volume  employees headquarters_city headquarters_state

0   2019-03-01   AMZN  1655.13  1674.2600  1651.00  1671.73   4974877   613300.0           Seattle                 WA

1   2019-03-04   AMZN  1685.00  1709.4300  1674.36  1696.17   6167358   613300.0           Seattle                 WA

2   2019-03-05   AMZN  1702.95  1707.8000  1689.01  1692.43   3681522   613300.0           Seattle                 WA

3   2019-03-06   AMZN  1695.97  1697.7500  1668.28  1668.95   3996001   613300.0           Seattle                 WA

4   2019-03-07   AMZN  1667.37  1669.7500  1620.51  1625.95   4957017   613300.0           Seattle                 WA

5   2019-03-01   AAPL   174.28   175.1500   172.89   174.97  25886167   132000.0         Cupertino                 CA

6   2019-03-04   AAPL   175.69   177.7500   173.97   175.85  27436203   132000.0         Cupertino                 CA

7   2019-03-05   AAPL   175.94   176.0000   174.54   175.53  19737419   132000.0         Cupertino                 CA

8   2019-03-06   AAPL   174.67   175.4900   173.94   174.52  20810384   132000.0         Cupertino                 CA

9   2019-03-07   AAPL   173.87   174.4400   172.02   172.50  24796374   132000.0         Cupertino                 CA

10  2019-03-01   GOOG  1124.90  1142.9700  1124.75  1140.99   1450316    98771.0     Mountain View                 CA

11  2019-03-04   GOOG  1146.99  1158.2800  1130.69  1147.80   1446047    98771.0     Mountain View                 CA

12  2019-03-05   GOOG  1150.06  1169.6100  1146.19  1162.03   1443174    98771.0     Mountain View                 CA

13  2019-03-06   GOOG  1162.49  1167.5700  1155.49  1157.86   1099289    98771.0     Mountain View                 CA

14  2019-03-07   GOOG  1155.72  1156.7600  1134.91  1143.30   1166559    98771.0     Mountain View                 CA

15  2019-03-01     FB   162.60   163.1320   161.69   162.28  11097770        NaN               NaN                NaN

16  2019-03-04     FB   163.90   167.5000   163.83   167.37  18894689        NaN               NaN                NaN

17  2019-03-05     FB   167.37   171.8800   166.55   171.26  28187890        NaN               NaN                NaN

18  2019-03-06     FB   172.90   173.5700   171.27   172.51  21531723        NaN               NaN                NaN

19  2019-03-07     FB   171.50   171.7400   167.61   169.13  18306504        NaN               NaN                NaN

20  2019-03-01   TSLA   306.94   307.1300   291.90   294.79  22911375        NaN               NaN                NaN

21  2019-03-04   TSLA   298.12   299.0000   282.78   285.36  17096818        NaN               NaN                NaN

22  2019-03-05   TSLA   282.00   284.0000   270.10   276.54  18764740        NaN               NaN                NaN

23  2019-03-06   TSLA   276.48   281.5058   274.39   276.24  10335485        NaN               NaN                NaN

24  2019-03-07   TSLA   278.84   284.7000   274.25   276.59   9442483        NaN               NaN                NaN

Let’s walk through what’s happening here.

First, we load our companies DataFrame from a URL. When we look at that DataFrame, we can see it has data for three of our companies—AMZN, AAPL, and GOOG—but not for two others—FB and TSLA.

We then use the join() method to combine our two DataFrames. We do this using a left join and specify that the join should occur on the symbol column, which is common to both DataFrames.

After our join is complete, we print out the DataFrame. You can see that there are the 25 rows from the original DataFrame and that some of those rows have been enriched with information about company headquarters and number of employees. However, it retained those rows that didn’t have matching information in the second DataFrame.

Using an inner join with Pandas `join` method

Now let’s see how the same two DataFrames would be joined when using an inner join.

>>> enriched_inner = combined.join(companies.set_index('symbol'), on='symbol', how='inner')

>>> enriched_inner

          date symbol     open     high      low    close    volume  employees headquarters_city headquarters_state

0   2019-03-01   AMZN  1655.13  1674.26  1651.00  1671.73   4974877     613300           Seattle                 WA

1   2019-03-04   AMZN  1685.00  1709.43  1674.36  1696.17   6167358     613300           Seattle                 WA

2   2019-03-05   AMZN  1702.95  1707.80  1689.01  1692.43   3681522     613300           Seattle                 WA

3   2019-03-06   AMZN  1695.97  1697.75  1668.28  1668.95   3996001     613300           Seattle                 WA

4   2019-03-07   AMZN  1667.37  1669.75  1620.51  1625.95   4957017     613300           Seattle                 WA

5   2019-03-01   AAPL   174.28   175.15   172.89   174.97  25886167     132000         Cupertino                 CA

6   2019-03-04   AAPL   175.69   177.75   173.97   175.85  27436203     132000         Cupertino                 CA

7   2019-03-05   AAPL   175.94   176.00   174.54   175.53  19737419     132000         Cupertino                 CA

8   2019-03-06   AAPL   174.67   175.49   173.94   174.52  20810384     132000         Cupertino                 CA

9   2019-03-07   AAPL   173.87   174.44   172.02   172.50  24796374     132000         Cupertino                 CA

10  2019-03-01   GOOG  1124.90  1142.97  1124.75  1140.99   1450316      98771     Mountain View                 CA

11  2019-03-04   GOOG  1146.99  1158.28  1130.69  1147.80   1446047      98771     Mountain View                 CA

12  2019-03-05   GOOG  1150.06  1169.61  1146.19  1162.03   1443174      98771     Mountain View                 CA

13  2019-03-06   GOOG  1162.49  1167.57  1155.49  1157.86   1099289      98771     Mountain View                 CA

14  2019-03-07   GOOG  1155.72  1156.76  1134.91  1143.30   1166559      98771     Mountain View                 CA

Our join code is the exact same as the previous section except that we changed from a left to an inner join. When we print out the DataFrame, you can see the difference—we only have fifteen lines in our combined DataFrame. The 10 lines of FB and TSLA that didn’t have a match in our companies DataFrame were dropped entirely rather than including null values for the missing columns.

The type of join you want to use will depend on your use case. In these two examples, we learned how the type of join affects which rows are retained in a join operation.

Want to Code Faster?

Download Kite Free

Conclusion

In this post, we learned how to combine DataFrames in Pandas. We learned that there are two types of ways to combine DataFrames—horizontally or vertically. We then walked through examples of combining vertically with concat() and horizontally with join() or merge(). Finally, we learned about the different types of joins with using join() or merge().

Remembering the different ways to combine DataFrames or to perform joins can be tricky, but Kite has you covered. Kite is an autocomplete for Python to speed up your workflow while coding. Further, it helps you look up Python docs without going to Google. Kite even provides examples of how other people have used similar libraries and methods, allowing you to take advantage of the expertise of others. Check it out today!

View Article Comments

Pandas Merge, Join, and Concat: How To and Examples

Table of Contents

When to use the Pandas `concat` vs. `merge` and `join`

How to use Pandas’ `concat` method

Using Pandas’ `merge` and `join` to combine DataFrames

Using a left join with Pandas `join` method

Using an inner join with Pandas `join` method

Conclusion

Company

Product

Resources

Stay in touch

Pandas Merge, Join, and Concat: How To and Examples

Table of Contents

When to use the Pandas concat vs. merge and join

How to use Pandas’ concat method

Using Pandas’ merge and join to combine DataFrames

Using a left join with Pandas join method

Using an inner join with Pandas join method

Conclusion

Company

Product

Resources

Stay in touch

When to use the Pandas `concat` vs. `merge` and `join`

How to use Pandas’ `concat` method

Using Pandas’ `merge` and `join` to combine DataFrames

Using a left join with Pandas `join` method

Using an inner join with Pandas `join` method