2025, Nov 29 19:00

How to correctly extract /images/ URLs from inline JavaScript using Python regex and BeautifulSoup

Learn why /images/* fails and how to reliably extract url values from inline scripts using Python regex and BeautifulSoup. Step-by-step fix with examples.

Extracting structured data from inline scripts can be deceptively tricky when the first instinct is to throw a quick regex at it. Here, the goal is straightforward: pull the path strings that follow url in a script block on a galaxy archive page and reuse them later to request higher-resolution images. The initial approach returns only the literal occurrences of the word images because of a subtle regex mistake. Let’s walk through what’s happening and how to fix it cleanly.

Reproducing the issue

The page is fetched with requests, then parsed with BeautifulSoup. The first script tag is selected, and a regex is applied to extract paths. However, the result is an array filled with partial matches that don’t contain the desired URLs.

import re
import requests
from bs4 import BeautifulSoup

page_addr = 'https://esahubble.org/images/archive/category/galaxies/page/1/'
resp = requests.get(page_addr)
dom = BeautifulSoup(resp.text, 'html.parser')
js_tag = dom.find('script')
paths = re.findall(r"/images/*", str(js_tag))

print(paths)

What’s going wrong and why

The pattern /images/* doesn’t mean “/images/ followed by anything.” The asterisk applies only to the character immediately before it. In this case it translates to “/images/ followed by zero or more forward slashes,” which explains why the matches collapse to the literal segment around images without capturing the rest of the path.

There are two workable targets in this script block. You can either capture the value that follows url: '...'(which returns the complete relative path), or you can capture only the first segment after /images/ if that’s all you need. Both are valid, depending on how you plan to build subsequent requests.

Solution: match the url value explicitly

The most direct route is to match the url field and grab the quoted value. This yields an array of strings like /images/heic2018b/ that you can join with the site base path or use as-is.

import re
import requests
from bs4 import BeautifulSoup

page_addr = 'https://esahubble.org/images/archive/category/galaxies/page/1/'
resp = requests.get(page_addr)
dom = BeautifulSoup(resp.text, 'html.parser')
js_tag = dom.find('script')

pattern = r"url\s*:\s*'([^']+)'"
collected = re.findall(pattern, str(js_tag))

print(collected)

If you only want the first part after /images/ and not the trailing slash or deeper segments, you can switch the pattern to match that slice directly: /images/[^/]+. This anchors on /images/ and stops at the next slash, returning values like /images/heic2018b.

import re
import requests
from bs4 import BeautifulSoup

page_addr = 'https://esahubble.org/images/archive/category/galaxies/page/1/'
resp = requests.get(page_addr)
dom = BeautifulSoup(resp.text, 'html.parser')
js_tag = dom.find('script')

subset = re.findall(r"/images/[^/]+", str(js_tag))

print(subset)

Optional: persist results to a file

If you prefer to store the extracted paths on disk for later processing, you can write them line by line. This mirrors the approach of saving all captured url values.

import re

js_blob = r"""
{
    id: 'heic2018b',
    title: 'Galaxy NGC 2525',
    width: 3657,
    height: 3920,
    src: 'https://cdn.esahubble.org/archives/images/thumb300y/heic2018b.jpg',
    url: '/images/heic2018b/',
    potw: ''
},

{
    id: 'potw1345a',
    title: 'Antennae Galaxies reloaded',
    width: 4240,
    height: 4211,
    src: 'https://cdn.esahubble.org/archives/images/thumb300y/potw1345a.jpg',
    url: '/images/potw1345a/',
    potw: '11 November 2013'
},

{
    id: 'heic0817a',
    title: 'Magnetic monster NGC 1275',
    width: 4633,
    height: 3590,
    src: 'https://cdn.esahubble.org/archives/images/thumb300y/heic0817a.jpg',
    url: '/images/heic0817a/',
    potw: ''
},
"""

selector = r"url\s*:\s*'([^']+)'"
found = re.findall(selector, js_blob)

with open('after_urls.txt', 'w') as fh:
    for item in found:
        fh.write(item + '\n')

print(f"Wrote {len(found)} after URLs to after_urls.txt")

Why this matters

Small regex missteps often lead to confusing partial results, especially when scraping semi-structured JavaScript embedded in HTML. Understanding that * repeats the preceding token—and does not mean “anything”—prevents subtle bugs and wasted requests. Anchoring patterns to known keys like url and capturing quoted values is generally safer and more maintainable for this kind of task.

Wrapping up

When parsing inline data blocks, target stable anchors in the text. For this page, extracting the value of url is robust and returns exactly the relative paths you need. If your use case requires only the first segment after /images/, tighten the character class to stop at the next slash. Keep the scraping pipeline simple: fetch, locate the script, apply a precise regex, and reuse the resulting array for downstream requests.