Discovery¶
Indieweb utils provides a number of functions to help you determine properties from webpages.
Discover IndieWeb endpoints¶
The discover_endpoints() function parses HTTP Link headers and HTML <link> tags to find the specified endpoints.
- indieweb_utils.discover_endpoints(url: str, headers_to_find: List[str], request: Response | None = None)[source]¶
Return a dictionary of specified endpoint locations for the given URL, if available.
- Parameters:
- Returns:
The discovered endpoints.
- Return type:
import indieweb_utils import requests url = "https://jamesg.blog/" # find the microsub rel link on a web page headers_to_find = ["microsub"] endpoints = indieweb_utils.discover_endpoints( url ) print(endpoints) # {'microsub': 'https://aperture.p3k.io/'}
- Raises:
requests.exceptions.RequestException – Error raised while making the network request to discover endpoints.
This function only returns the specified endpoints if they can be found. It does not perform any validation to check that the discovered endpoints are valid URLs.
We recommend using the discover_webmention_endpoint function to discover webmention endpoints as this performs additional validation useful in webmention endpoint discovery.
Find a post type¶
To find the post type associated with a web page, you can use the get_post_type function.
The get_post_type function function uses the following syntax:
- indieweb_utils.get_post_type(h_entry: dict = {}, custom_properties: List[Tuple[str, str]] = []) str [source]¶
Return the type of a h-entry per the Post Type Discovery algorithm.
- Parameters:
- Returns:
The type of the h-entry.
- Return type:
Here is an example of the function in action:
import indieweb_utils import mf2py url = "https://jamesg.blog/2022/01/28/integrated-indieweb-services/" parsed_mf2 = mf2py.parse(url=url) h_entry = [e for e in parsed_mf2["items"] if e["type"] == ["h-entry"]][0] post_type = indieweb_utils.get_post_type( h_entry ) print(post_type) # article
- Raises:
PostTypeFormattingError – Raised when you specify a custom_properties tuple in the wrong format.
This function returns a single string with the post type of the specified web page.
See the Post Type Discovery specification for a full list of post types.
Custom Properties¶
The structure of the custom properties tuple is:
(attribute_to_look_for, value_to_return)
An example custom property value is:
custom_properties = [
("poke-of", "poke")
]
This function would look for a poke-of attribute on a web page and return the “poke” value.
By default, this function contains all of the attributes in the Post Type Discovery mechanism.
Custom properties are added to the end of the post type discovery list, just before the “article” property. All specification property types are checked before your custom attribute.
Find the original version of a post¶
To find the original version of a post per the Original Post Discovery algorithm, use this code:
- indieweb_utils.discover_original_post(posse_permalink: str, soup: BeautifulSoup | None = None, html: str = '') str [source]¶
Find the original version of a post per the Original Post Discovery algorithm.
refs: https://indieweb.org/original-post-discovery#Algorithm
- Parameters:
posse_permalink (str) – The permalink of the post.
- Returns:
The original post permalink.
- Return type:
Example:
import indieweb_utils original_post_url = indieweb_utils.discover_original_post("https://example.com") print(original_post_url)
- Raises:
PostDiscoveryError – A candidate URL cannot be retrieved or when a specified post is not marked up with h-entry.
This function returns the URL of the original version of a post, if one is found. Otherwise, None is returned.
Find all feeds on a page¶
To discover the feeds on a page, use this function:
- indieweb_utils.discover_web_page_feeds(url: str, user_mime_types: List[str] | None = None, html: str = '') List[FeedUrl] [source]¶
Get all feeds on a web page.
- Parameters:
- Returns:
A list of FeedUrl objects.
- Return type:
List[FeedUrl]
Example:
import indieweb_utils url = "https://jamesg.blog/" feeds = indieweb_utils.discover_web_page_feeds(url) # print the url of all feeds to the console for f in feeds: print(f.url)
This function returns a list with all feeds on a page.
Each feed is structured as a FeedUrl object. FeedUrl objects contain the following attributes:
Get a Representative h-card¶
To find the h-card that is considered representative of a web resource per the Representative h-card Parsing Algorithm, use the following function:
- indieweb_utils.get_representative_h_card(url: str, html: str = '', parsed_mf2: Parser | None = None) Dict[str, Any] [source]¶
Get the representative h-card on a page per the Representative h-card Parsing algorithm.
refs: https://microformats.org/wiki/representative-h-card-parsing
- Url:
The url to parse.
- Returns:
The representative h-card.
- Return type:
Example:
import indieweb_utils url = "https://jamesg.blog/" h_card = indieweb_utils.get_representative_h_card(url) print(h_card) # {'type': ['h-card'], 'properties': {...}}
- Raises:
RepresentativeHCardParsingError – Representative h-card could not be parsed.
This function returns a dictionary with the h-card found on a web page.
Get a Page h-feed¶
The discover_h_feed() function implements the proposed microformats2 h-feed discovery algorithm.
This function looks for a h-feed on a given page. If one is not found, the function looks for a rel tag to a h-feed. If one is found, that document is parsed.
If a h-feed is found on the related document, the h-feed is returned.
This function returns a dictionary with the h-card found on a web page.
- indieweb_utils.discover_h_feed(url: str, html: str = '') Dict [source]¶
Find the main h-feed that represents a web page as per the h-feed Discovery algorithm.
refs: https://microformats.org/wiki/h-feed#Discovery
- Parameters:
- Returns:
The h-feed data.
- Return type:
Example:
import indieweb_utils url = "https://jamesg.blog/" hfeed = indieweb_utils.discover_h_feed(url) print(hfeed)
Infer the Name of a Page¶
To find the name of a page per the Page Name Discovery Algorithm, use this function:
- indieweb_utils.get_page_name(url: str, html: str | None = None, soup: BeautifulSoup | None = None) str [source]¶
Retrieve the name of a page using the Page Name Discovery algorithm.
- Refs:
- Parameters:
- Returns:
A representative “name” for the page.
- Return type:
Example:
import indieweb_utils page_name = indieweb_utils.get_page_name("https://jamesg.blog") print(page_name) # "Home | James' Coffee Blog"
This function searches:
For a h-entry title. If one is found, it is returned;
For a h-entry summary. If one is found, it is returned;
For a HTML page <title> tag. If one is found, it is returned;
Otherwise, this function returns “Untitled”.
Get all URLs a Post Replies To¶
To find all of the URLs to which a reply post is replying, use this function:
- indieweb_utils.get_reply_urls(url: str, html: str | None = None) List[str] [source]¶
Retrieve a list of all of the URLs to which a given post is responding using a u-in-reply-to microformat.
- Refs:
- Parameters:
- Returns:
A list of all of the URLs to which the given post responds.
- Return type:
Example:
import indieweb_utils reply_urls = indieweb_utils.get_reply_urls("https://aaronparecki.com/2022/10/10/17/") print(reply_urls) # ["https://twitter.com/amandaljudkins/status/1579680989135384576?s=12"]