CVE-2024-0243

With the following crawler configuration:

1
2
3
4
5
6
7
from bs4 import BeautifulSoup as Soup

url = https://example.com
loader = RecursiveUrlLoader(
    url=url, max_depth=2, extractor=lambda x: Soup(x, html.parser).text
)
docs = loader.load()

An attacker in control of the contents of https://example.com could place a malicious HTML file in there with links like https://example.completely.different/my_file.html and the crawler would proceed to download that file as well even though prevent_outside=True.

https://github.com/langchain-ai/langchain/blob/bf0b3cc0b5ade1fb95a5b1b6fa260e99064c2e22/libs/community/langchain_community/document_loaders/recursive_url_loader.py#L51-L51

Resolved in https://github.com/langchain-ai/langchain/pull/15559

Weakness

The web server receives a URL or similar request from an upstream component and retrieves the contents of this URL, but it does not sufficiently ensure that the request is being sent to the expected destination.

Affected Software

Name	Vendor	Start Version	End Version
Langchain	Langchain	*	0.1.0 (excluding)

https://cwe.mitre.org/data/definitions/664.html

NVD	https://nvd.nist.gov/vuln/detail/CVE-2024-0243
CWE	https://cwe.mitre.org/data/definitions/918.html

Server-Side Request Forgery (SSRF)

Weakness

Affected Software

References

CVE-2024-0243

Server-Side Request Forgery (SSRF)

Weakness

Affected Software

Related Attack Patterns

References