Pandas Merge, Join, and Concat: How To and Examples
Table of Contents
- When to use the Pandas concat vs. merge and join
- How to use Pandas’ concat method
- Using Pandas’ merge and join to combine DataFrames
- Conclusion
Python’s Pandas library is a popular library for cleaning, manipulating, and interpreting large amounts of data. In an earlier post, we looked at the foundational structure in Pandas—the DataFrame. If you’re unfamiliar with Pandas DataFrames, take a look at that post to understand the basics.
In this post, we’ll learn how to combine multiple DataFrames using Pandas merge
, join
, and concat
. These methods let you supercharge your data by gluing together data from different sources.
We’ll cover:
- When to use
concat
and when to usemerge
orjoin
; - How to use Pandas’
concat
method; - How to use Pandas’
merge
andjoin
.
Let’s get started.
When to use the Pandas concat
vs. merge
and join
While merge
, join
, and concat
all work to combine multiple DataFrames, they are used for very different things. In this section, we’ll learn when you will want to use one operation over another. The key distinction is whether you want to combine your DataFrames horizontally or vertically.
The concat
method allows you to combine DataFrames vertically.
Imagine you had two DataFrames with the same columns. Perhaps the first DataFrame includes 10 rows of stock trading data for one stock while the second DataFrame includes 10 rows of stock trading data for a different stock. A vertical combination would use a DataFrame’s concat
method to combine the two DataFrames into a single DataFrame with twenty rows.
Notice that in a vertical combination with concat
, the number of rows has increased but the number of columns has stayed the same.
By contrast, the merge
and join
methods help to combine DataFrames horizontally.
Imagine you have two DataFrames. The first contains stock trading information various companies. The second contains information about the headquarters and numbers of employees for a particular company. If the two DataFrames have one field in common—such as a stock symbol or company name—you can combine the two DataFrames so that each row contains both the stock trading data and the company background information.
Notice that in this horizontal combination, we aren’t adding any additional rows. Rather, we’re adding columns to existing rows. The horizontal combination from a merge
operation is similar to a JOIN
operator in SQL.
Now that we understand the difference between vertical combinations with concat
and horizontal combinations with merge
or join
, let’s take a deeper look at how to use these methods.
How to use Pandas’ concat
method
In our previous post on Pandas DataFrames, we used sample stock data to show create, explore, and manipulate DataFrames. We used sample stock trading data from Amazon (“AMZN”), Apple (“AAPL”) and Google (“GOOG”).
Imagine we had a second data set with trading information on two additional companies, Facebook (“FB”) and Tesla (“TSLA”). These DataFrames have the same shape, so it would be useful to combine them to operate on them together.
We can do this with the concat
method as follows:
>>> import pandas as pd
>>> stocks = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv')
>>> stocks.info()
< class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 7 columns):
date 15 non-null object
symbol 15 non-null object
open 15 non-null float64
high 15 non-null float64
low 15 non-null float64
close 15 non-null float64
volume 15 non-null int64
dtypes: float64(4), int64(1), object(2)
memory usage: 920.0+ bytes
>>> stocks2 = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/53ebac036b404875ef8e981c0cbd0901/raw/6c70336130eb7e45cec167ee7cd52d15baa392ea/stocks2.csv')
>>> combined = pd.concat([stocks, stock2], ignore_index=True)
>>> combined
date symbol open high low close volume
0 2019-03-01 AMZN 1655.13 1674.2600 1651.00 1671.73 4974877
1 2019-03-04 AMZN 1685.00 1709.4300 1674.36 1696.17 6167358
2 2019-03-05 AMZN 1702.95 1707.8000 1689.01 1692.43 3681522
3 2019-03-06 AMZN 1695.97 1697.7500 1668.28 1668.95 3996001
4 2019-03-07 AMZN 1667.37 1669.7500 1620.51 1625.95 4957017
5 2019-03-01 AAPL 174.28 175.1500 172.89 174.97 25886167
6 2019-03-04 AAPL 175.69 177.7500 173.97 175.85 27436203
7 2019-03-05 AAPL 175.94 176.0000 174.54 175.53 19737419
8 2019-03-06 AAPL 174.67 175.4900 173.94 174.52 20810384
9 2019-03-07 AAPL 173.87 174.4400 172.02 172.50 24796374
10 2019-03-01 GOOG 1124.90 1142.9700 1124.75 1140.99 1450316
11 2019-03-04 GOOG 1146.99 1158.2800 1130.69 1147.80 1446047
12 2019-03-05 GOOG 1150.06 1169.6100 1146.19 1162.03 1443174
13 2019-03-06 GOOG 1162.49 1167.5700 1155.49 1157.86 1099289
14 2019-03-07 GOOG 1155.72 1156.7600 1134.91 1143.30 1166559
15 2019-03-01 FB 162.60 163.1320 161.69 162.28 11097770
16 2019-03-04 FB 163.90 167.5000 163.83 167.37 18894689
17 2019-03-05 FB 167.37 171.8800 166.55 171.26 28187890
18 2019-03-06 FB 172.90 173.5700 171.27 172.51 21531723
19 2019-03-07 FB 171.50 171.7400 167.61 169.13 18306504
20 2019-03-01 TSLA 306.94 307.1300 291.90 294.79 22911375
21 2019-03-04 TSLA 298.12 299.0000 282.78 285.36 17096818
22 2019-03-05 TSLA 282.00 284.0000 270.10 276.54 18764740
23 2019-03-06 TSLA 276.48 281.5058 274.39 276.24 10335485
24 2019-03-07 TSLA 278.84 284.7000 274.25 276.59 9442483
In the example above, we create our first DataFrame and use the DataFrame’s info method to see that there are 15 rows and seven columns in our DataFrame. Then, we create our second DataFrame and combine it with our first DataFrame using the concat
method. After this, we can see our DataFrame includes information on all five companies in across our two DataFrames.
If we use the info method to look at our new DataFrame, we can see that there are still seven columns but now there are 25 rows—the combination of the two DataFrames.
>>> combined.info()
< class 'pandas.core.frame.DataFrame'>
Int64Index: 25 entries, 0 to 9
Data columns (total 7 columns):
date 25 non-null object
symbol 25 non-null object
open 25 non-null float64
high 25 non-null float64
low 25 non-null float64
close 25 non-null float64
volume 25 non-null int64
dtypes: float64(4), int64(1), object(2)
memory usage: 1.6+ KB
The concat
method is a great way to combine multiple DataFrames that contain similar data shapes. Note that the columns don’t need to match entirely—Pandas will simply include a null value for columns without values.
Using Pandas’ merge
and join
to combine DataFrames
The merge
and join
methods are a pair of methods to horizontally combine DataFrames with Pandas. This is a great way to enrich with DataFrame with the data from another DataFrame.
Both merge
and join
are operating in similar ways, but the join
method is a convenience method to make it easier to combine DataFrames. In fact, join
is using merge
under the hood. I prefer to use join
where possible as it’s slightly easier syntax.
When using either merge
or join
, you’ll need to specify how the DataFrames should be merged or joined. There are four possible values for how to join two DataFrames:
- Left: Retain all rows for the first DataFrame and enrich with columns from the second DataFrame where they match on the columns on which to join;
- Right: Same as left but reversed—retain all rows for the second DataFrame and enrich with columns from matches in the first DataFrame.
- Inner: Retain only the intersection of the two DataFrames—rows in which there are values in both DataFrames for the columns on which the join is performed.
- Outer: Retain all rows from both DataFrames regardless of whether there are matching rows in the other DataFrame.
These four join types match up with the four types of SQL joins. Note that the default value for how in the merge
method is inner, while the default value for how in the join
method is left. In general, it’s best to explicitly specify the type of join you want for clarity.
To understand how the different types of joins work, let’s walk through some examples.
Using a left join with Pandas join
method
First, let’s see how to use the left join type. We will continue to use our combined stock data from the concat
section above. Imagine we want to enrich that data with background information about the company, such as how many employees they have as well as where the company’s headquarters is located.
Let’s load our company background information, then join it with our trading data using a left join.
>>> companies = pd.read_csv('https://gist.githubusercontent.com/alexdebrie/90954a10bbdebd89185f7b4d340563cd/raw/d91744e599356b7a52b9c027751d4613c06c5f40/companies.csv')
>>> companies
symbol employees headquarters_city headquarters_state
0 AMZN 613300 Seattle WA
1 GOOG 98771 Mountain View CA
2 AAPL 132000 Cupertino CA
>>> enriched = combined.join(companies.set_index('symbol'), on='symbol', how='left')
>>> enriched
date symbol open high low close volume employees headquarters_city headquarters_state
0 2019-03-01 AMZN 1655.13 1674.2600 1651.00 1671.73 4974877 613300.0 Seattle WA
1 2019-03-04 AMZN 1685.00 1709.4300 1674.36 1696.17 6167358 613300.0 Seattle WA
2 2019-03-05 AMZN 1702.95 1707.8000 1689.01 1692.43 3681522 613300.0 Seattle WA
3 2019-03-06 AMZN 1695.97 1697.7500 1668.28 1668.95 3996001 613300.0 Seattle WA
4 2019-03-07 AMZN 1667.37 1669.7500 1620.51 1625.95 4957017 613300.0 Seattle WA
5 2019-03-01 AAPL 174.28 175.1500 172.89 174.97 25886167 132000.0 Cupertino CA
6 2019-03-04 AAPL 175.69 177.7500 173.97 175.85 27436203 132000.0 Cupertino CA
7 2019-03-05 AAPL 175.94 176.0000 174.54 175.53 19737419 132000.0 Cupertino CA
8 2019-03-06 AAPL 174.67 175.4900 173.94 174.52 20810384 132000.0 Cupertino CA
9 2019-03-07 AAPL 173.87 174.4400 172.02 172.50 24796374 132000.0 Cupertino CA
10 2019-03-01 GOOG 1124.90 1142.9700 1124.75 1140.99 1450316 98771.0 Mountain View CA
11 2019-03-04 GOOG 1146.99 1158.2800 1130.69 1147.80 1446047 98771.0 Mountain View CA
12 2019-03-05 GOOG 1150.06 1169.6100 1146.19 1162.03 1443174 98771.0 Mountain View CA
13 2019-03-06 GOOG 1162.49 1167.5700 1155.49 1157.86 1099289 98771.0 Mountain View CA
14 2019-03-07 GOOG 1155.72 1156.7600 1134.91 1143.30 1166559 98771.0 Mountain View CA
15 2019-03-01 FB 162.60 163.1320 161.69 162.28 11097770 NaN NaN NaN
16 2019-03-04 FB 163.90 167.5000 163.83 167.37 18894689 NaN NaN NaN
17 2019-03-05 FB 167.37 171.8800 166.55 171.26 28187890 NaN NaN NaN
18 2019-03-06 FB 172.90 173.5700 171.27 172.51 21531723 NaN NaN NaN
19 2019-03-07 FB 171.50 171.7400 167.61 169.13 18306504 NaN NaN NaN
20 2019-03-01 TSLA 306.94 307.1300 291.90 294.79 22911375 NaN NaN NaN
21 2019-03-04 TSLA 298.12 299.0000 282.78 285.36 17096818 NaN NaN NaN
22 2019-03-05 TSLA 282.00 284.0000 270.10 276.54 18764740 NaN NaN NaN
23 2019-03-06 TSLA 276.48 281.5058 274.39 276.24 10335485 NaN NaN NaN
24 2019-03-07 TSLA 278.84 284.7000 274.25 276.59 9442483 NaN NaN NaN
Let’s walk through what’s happening here.
First, we load our companies DataFrame from a URL. When we look at that DataFrame, we can see it has data for three of our companies—AMZN, AAPL, and GOOG—but not for two others—FB and TSLA.
We then use the join()
method to combine our two DataFrames. We do this using a left join and specify that the join should occur on the symbol
column, which is common to both DataFrames.
After our join is complete, we print out the DataFrame. You can see that there are the 25 rows from the original DataFrame and that some of those rows have been enriched with information about company headquarters and number of employees. However, it retained those rows that didn’t have matching information in the second DataFrame.
Using an inner join with Pandas join
method
Now let’s see how the same two DataFrames would be joined when using an inner join.
>>> enriched_inner = combined.join(companies.set_index('symbol'), on='symbol', how='inner')
>>> enriched_inner
date symbol open high low close volume employees headquarters_city headquarters_state
0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877 613300 Seattle WA
1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358 613300 Seattle WA
2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522 613300 Seattle WA
3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001 613300 Seattle WA
4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017 613300 Seattle WA
5 2019-03-01 AAPL 174.28 175.15 172.89 174.97 25886167 132000 Cupertino CA
6 2019-03-04 AAPL 175.69 177.75 173.97 175.85 27436203 132000 Cupertino CA
7 2019-03-05 AAPL 175.94 176.00 174.54 175.53 19737419 132000 Cupertino CA
8 2019-03-06 AAPL 174.67 175.49 173.94 174.52 20810384 132000 Cupertino CA
9 2019-03-07 AAPL 173.87 174.44 172.02 172.50 24796374 132000 Cupertino CA
10 2019-03-01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316 98771 Mountain View CA
11 2019-03-04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047 98771 Mountain View CA
12 2019-03-05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174 98771 Mountain View CA
13 2019-03-06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289 98771 Mountain View CA
14 2019-03-07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559 98771 Mountain View CA
Our join
code is the exact same as the previous section except that we changed from a left to an inner join. When we print out the DataFrame, you can see the difference—we only have fifteen lines in our combined DataFrame. The 10 lines of FB and TSLA that didn’t have a match in our companies DataFrame were dropped entirely rather than including null values for the missing columns.
The type of join you want to use will depend on your use case. In these two examples, we learned how the type of join affects which rows are retained in a join
operation.
Conclusion
In this post, we learned how to combine DataFrames in Pandas. We learned that there are two types of ways to combine DataFrames—horizontally or vertically. We then walked through examples of combining vertically with concat()
and horizontally with join()
or merge()
. Finally, we learned about the different types of joins with using join()
or merge()
.
Remembering the different ways to combine DataFrames or to perform joins can be tricky, but Kite has you covered. Kite is an autocomplete for Python to speed up your workflow while coding. Further, it helps you look up Python docs without going to Google. Kite even provides examples of how other people have used similar libraries and methods, allowing you to take advantage of the expertise of others. Check it out today!