These bash one liners served me well when exploring raw HTML and XML. Once building an ingestion / cleaning pipeline I would output more detailed statistics about tag usage.
For HTML, you may want to use a looser regex like this one. The regex is searching for anything starting with <. It allows the tag to not to be closed.
ag "</?([^> ]*)?>?" path/to/html/ -io \
| sed -n -e 's/^.*:<?\/?//p' \
| sort --unique
For XML, finding the end tags (e.g. </EndTag>) may be sufficient. I didn’t consider start tags they can have attributes (e.g. <StartTag Attribute=X>). This assumes the XML is already decently formatted.
ag "</.*?>" path/to/xml/ -oi \
| sed -n -e 's/^.*://p' \
| sort --unique
Notes:
ag is the silver searcher
sed is removing the line number from the result. (e.g. ag returns 123:<br />)
sort --unique sorts the lines and with the --unique option discards the duplicates