Skip to content

Cold Email A/B Testing: What to Test and How to Measure Results

Most people run cold email campaigns, get mediocre results, and move on.

At imisofts, we run 100+ A/B tests monthly. Each test teaches us something. Compound those lessons, and you get 65-75% open rates and 3-5% reply rates.

This is our A/B testing framework.

The Cold Email A/B Testing Hierarchy

Not all tests have equal impact. We prioritize:

  1. Subject line (highest impact)
  2. Opening line (second highest)
  3. Value statement (medium impact)
  4. CTA (medium-low impact)
  5. Send time (lowest impact)

Never test low-impact variables first. You'll waste samples before finding the real wins.

Testing Subject Lines (Highest Impact)

Subject lines determine open rate. Open rate determines everything else.

Test structure:

Variant A (baseline): [Current subject line that gets 45% opens]

Variant B (test): [New subject line with different formula]

Sample size: 50-100 prospects each

Duration: 3-5 days

Example test:

Variant A: "hi john, noticed you launched [product]" (45% open rate)

Variant B: "quick thought on [industry]" (52% open rate)

Winner: Variant B (+7 percentage points)

This 7-point improvement compounds. Across 10,000 email prospects, that's 700 additional opens. 700 additional opens means 7-35 additional replies (at 1-5% reply rate).

That's 7-35 additional customers from changing one word.

Subject line tests we run:

  • Personalization (first name + achievement vs. generic)
  • Question format vs. statement format
  • Lowercase vs. Title Case
  • Short vs. specific
  • Different achievement angles

Testing Opening Lines (Second Highest Impact)

Opening line determines whether they read past the first sentence.

Test structure:

Email 1, Variant A: [Opening line A] + [rest of email unchanged]

Email 1, Variant B: [Opening line B] + [rest of email unchanged]

Sample size: 100 prospects each

Duration: 3-5 days

Metric: Open rate from click to read percentage (hard to track without advanced tools)

Alternative: Track reply rate as proxy for "engagement with content"

Example test:

Variant A: "I noticed you launched [product] last month." (1.2% reply rate)

Variant B: "Most SaaS founders spend 15 hours/week on prospecting. You probably do too." (1.8% reply rate)

Winner: Variant B (+0.6 percentage points reply rate)

0.6 points seems small. Across 10,000 prospects, it's 60 additional replies.

Opening line tests we run:

  • Specific personalization vs. general observation
  • Problem-first vs. achievement-first
  • Curiosity gap vs. direct statement
  • Industry pattern vs. company-specific observation

Testing Value Statements (Medium Impact)

Value statement is your chance to prove relevance before the pitch.

Test structure:

Email 1, Variant A: [Value statement A]

Email 1, Variant B: [Value statement B]

Metric: Email 1 reply rate or Email 2 open rate (if they engage with Email 1, they'll engage with Email 2)

Example test:

Variant A: "We help SaaS teams automate their prospecting and save 12 hours/week." (2% Email 1 reply)

Variant B: "One of your competitors just booked 8 qualified deals this month using [tactic]." (2.8% Email 1 reply)

Winner: Variant B (social proof outperforms direct benefit)

This changes your entire Email 1 strategy. Across campaigns, social proof hooks outperform benefit hooks by 20-40%.

Value statement tests we run:

  • Direct benefit vs. social proof
  • Specific metric vs. general statement
  • Industry pattern vs. company-specific observation
  • Problem-agitation vs. opportunity-excitement

Testing CTAs (Medium-Low Impact)

CTA wording has lower impact than subject/opening, but still matters.

Test structure:

Email 2, Variant A: "[CTA A] or reply with your timeline."

Email 2, Variant B: "[CTA B] or reply with your timeline."

Metric: Email 2 reply rate

Example test (SaaS):

Variant A: "Book a 15-min strategy call" (3.2% reply)

Variant B: "Are you open to a quick conversation?" (2.8% reply)

Winner: Variant A (direct booking link outperforms vague ask)

But this varies by industry. Medicare-focused campaigns might see opposite results (phone-first CTAs outperform calendar links).

CTA tests we run:

  • Direct booking link vs. "reply to schedule"
  • Soft ask vs. hard ask
  • Specific time ("15 min") vs. vague ("quick call")
  • Phone number vs. calendar link (varies by industry)

Testing Send Times (Lowest Impact)

When you send affects open rate, but much less than what you send.

Test structure:

Group A: Send on Tuesday, 10 AM

Group B: Send on Thursday, 10 AM

Metric: Open rate

Our data across 50M+ emails:

Tuesday: 45% open rate

Wednesday: 47% open rate

Thursday: 48% open rate

Friday: 40% open rate

Monday: 38% open rate

Best day: Thursday, 10 AM

Worst day: Monday, 10 AM

Difference: ~10 percentage points

That matters, but nowhere near as much as subject line testing (which can change open rate by 30+ points).

Send time tests we run:

  • Weekday vs. weekend
  • Morning vs. afternoon vs. evening
  • Time zone-specific sends
  • Industry-specific patterns (e.g., healthcare gets higher open on Friday due to weekly planning)

How to Measure Statistical Significance

You don't need a PhD in statistics. Here's the simple rule:

Sample size of 50+ per variant. If you see a 5%+ difference, it's probably real.

More rigorous approach:

Use a binomial test calculator.

Example:

  • Variant A: 45 opens out of 100 (45%)
  • Variant B: 52 opens out of 100 (52%)
  • Difference: 7 percentage points

Question: Is this real or random?

Plug into calculator. If p-value < 0.05, it's statistically significant (95% confidence). You can trust the result.

For cold email, we use this rule of thumb:

Sample < 50: Don't trust the result. Run more samples.

Sample 50-100: If difference > 5%, probably real.

Sample 100-200: If difference > 3%, probably real.

Sample 200+: If difference > 2%, probably real.

The Weekly Testing Cycle

Monday: Review last week's tests. Declare winners.

Tuesday-Wednesday: Roll out winning variant to 50% of new prospects.

Wednesday-Thursday: Run new tests on remaining 50%.

Friday: Measure results.

Monday: Repeat.

This weekly cycle compounds. Each week you find one new winning variant. Month 1, you're at baseline. Month 3, you're 30-40% above baseline.

What Not to Test

Don't test too many things at once

Wrong: Test subject line, opening line, CTA, send time simultaneously.

You won't know which variable won. Also called "multivariate testing" and it requires huge sample sizes.

Right: Test one variable per week.

Subject line week 1. Opening line week 2. CTA week 3. You learn faster and with smaller samples.

Don't test on tiny samples

Wrong: Test on 10 people per variant.

Too much variance. Random chance plays huge role.

Right: Test on 50+ people per variant minimum.

This gives signal above noise.

Don't declare winners too early

Wrong: Run test for 24 hours. Declare winner.

Time of day matters. Day of week matters. One day isn't enough.

Right: Run test for 5-7 days minimum.

This accounts for daily/weekly patterns.

Testing Template: What We Track

| Element | Variant A | Variant B | Winner | Notes |

|---------|----------|----------|--------|-------|

| Subject Line | hi john, noticed [product] | quick thought on [industry] | B | +7 points open rate |

| Opening Line | I noticed... | Most [industry]... | B | Engagement higher |

| Value Statement | Direct benefit | Social proof | B | Social proof +0.8% reply |

| CTA | Book call | Reply to schedule | A | Direct link better |

| Send Time | Tuesday 10 AM | Thursday 10 AM | B | Thursday +3 point open |

Tools for A/B Testing

At imisofts, we use:

  • Instantly (built-in A/B testing)
  • SmartLead (rotation + analytics)
  • Clay + Apollo (data merge + manual testing)
  • Custom scripts (for complex multivariate tests)

Most platforms now offer native A/B testing. Use it.

Results: What Testing Gets You

Baseline campaign (no testing):

  • Subject: 35% open
  • Reply: 1.5%

After 3 months of weekly testing:

  • Subject: 55% open (+20 points)
  • Opening: better engagement
  • CTA: better conversion
  • Reply: 3.5% (+2%)

That 2% improvement on reply rate is massive. It doubles your results.

What We Recommend at imisofts

We run A/B testing for all managed clients:

  • Weekly testing cycles
  • Subject line, opening, CTA, send time
  • Multivariate testing for scaled campaigns
  • Statistical significance validation
  • Monthly optimization reports

Packages start at $497/month (Management with testing) to $2,450/year (Enterprise with full testing suite).

Explore imisofts Cold Email Packages

Frequently Asked Questions

Minimum 50 per variant. With 50-person samples, a 5-point difference is meaningful. Larger samples (100+) let you detect smaller differences (2-3 points).
Minimum 5-7 days. This accounts for daily/weekly variations in open rates and behaviors. 24-hour tests are unreliable.
Subject line (highest impact), then opening line, then value statement, then CTA, then send time. Don't test low-impact variables first.
Not recommended. Test one variable per week. Testing multiple variables makes it impossible to know which one caused the change.
Use a binomial test calculator. If p-value < 0.05, your result is statistically significant (95% confidence). For cold email rules of thumb: 5%+ difference on 50 samples is likely real.

Ready to build your cold email infrastructure?

See our packages and get started with a system built for deliverability.

View Our Packages