2025, Nov 10 03:00

How to Join Two TSV Files by Key with awk: Preserve Tabs, Combine Rows, Control Order

Learn a robust awk approach to join two large TSV files by column keys. Preserve tabs with FS/OFS, store full rows, and output a single merged record per match.

Joining two large TSV files by a specific key looks trivial until the details bite back. A common case: match the first column of one file to the fifth column of another, and emit a single, tab-separated record that combines both lines. The catch is getting awk to both match correctly and preserve tab boundaries without printing the wrong side or collapsing whitespace.

Problem setup

We have two tab-separated files. The first file contains identifiers in column one:

anno1.g20653.t1	anno1.g20674.t1	eud1g02416	eud1g02458	27	+
anno2.g3796.t1	anno1.g20698.t1	eud1g02520	eud1g02556	28	+

The second file contains positional metadata, with the matching identifier in column five:

scaffold_1	transcript	11256	13613	anno1.g20653.t1
scaffold_1	transcript	25598	47989	anno1.g20066.t2

The goal is to output only the merged records where file1 column 1 equals file2 column 5, producing a single tab-separated line per match:

anno1.g20653.t1	anno1.g20674.t1	eud1g02416	eud1g02458	27	+	scaffold_1	transcript	11256	13613	anno1.g20653.t1

What goes wrong in the naïve attempt

The following attempt builds a presence map from the first column of file1 and then tries to select matching rows from file2, but the output contains only rows from file2:

awk '
FNR==NR { keep[$1]; next }
FNR>1 && ($5 in keep)
' File1 File2 > File3

This happens for two reasons. First, the script never saves the line content from file1, only the keys, so when a match is found there is nothing from file1 to print. Second, the program does not set FS and OFS to tabs, so fields are split on arbitrary whitespace and output fields are joined with a single space, which is risky for TSV data.

The essence of the issue

To produce a combined record, awk needs the full matching record from file1 alongside the current row from file2. Storing only keys from file1 limits you to filtering rows from file2 without context. Additionally, TSV processing should explicitly set FS and OFS to a tab to preserve empty fields and exact tab boundaries.

Working awk solution

The robust approach is to store full lines from file1 keyed by the join column, then, while reading file2, look up matches and print the concatenation in tab-separated form. This version emits output in the order of file2 and assumes the first column in file1 and the fifth column in file2 are unique:

awk '
BEGIN { FS=OFS="\t" }
NR==FNR { cache[$1] = $0; next }
($5 in cache) { print cache[$5], $0 }
' file1 file2

If you prefer the output to follow the order of file1, swap the input order and invert the lookup. This version stores file2 lines keyed by its fifth column and then walks file1:

awk '
BEGIN { FS=OFS="\t" }
NR==FNR { cache[$5] = $0; next }
($1 in cache) { print $0, cache[$1] }
' file2 file1

If each input contains a header row that should be skipped, insert a guard to skip the first record of each file:

awk '
BEGIN { FS=OFS="\t" }
FNR==1 { next }
NR==FNR { cache[$1] = $0; next }
($5 in cache) { print cache[$5], $0 }
' file1 file2

These programs explicitly set FS and OFS to tabs, retain entire matching lines from one file in memory, and print a single merged TSV row per match.

Why the details matter

Two assumptions shape behavior and performance. First, the approach depends on unique keys per side of the join: the first column in the first file and the fifth column in the second. If duplicates exist, only the last stored line per key is retained. Second, the script stores one entire file in memory to perform constant-time lookups, which allows fast merging but means memory usage scales with the stored file. Output order is a direct consequence of which file is read second.

Takeaways

When joining TSV data with awk, always set FS and OFS to "\t" to avoid whitespace pitfalls, store full rows from the build side to enable concatenation, and be conscious of uniqueness, memory footprint, and output ordering. With these in place, merging files by column is both concise and reliable.

The article is based on a question from StackOverflow by rseg and an answer by Ed Morton.