Regex for extracting all complex dates formats from a string in python

Multi tool use
Regex for extracting all complex dates formats from a string in python
I have following string:
dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
Here I want to extract all mentioned dates using regex
. As an attempt I have written following regex
:
regex
regex
import re
regEx = r'(?:d{1,2}[-/th|st|nd|rds]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-zs,.]*(?:d{1,2}[-/th|st|nd|rd)s,]*)?(?:d{2,4})'
re.findall(regEx, dateEntries)
I was expecting this to work but it only return subset of dates.
A = ['Mar 20, 2009',
'March 20, 2009',
'Mar. 20, 2009',
'Mar 20 2009',
'20 Mar 2009',
'20 March 2009',
'2 Mar. 2009',
'20 March, 2009',
'Mar 20th, 2009',
'Mar 21st, 2009',
'Mar 22nd, 2009',
'Feb 2009',
'Sep 2009',
'Oct 2010']
I'm not getting why its not returning the dates:
B=[04-20-2009; 04/20/09; 4/20/09; 4/3/09; 6/2008; 12/2009; 2009; 2010"]
I created the regEx
by extending the r'(?:d{1,2}[-s/])?(?:d{1,2}[-/s])?(?:d{2,4})'
which works good for set B. But regEx
is not able to produce A+B
regEx
r'(?:d{1,2}[-s/])?(?:d{1,2}[-/s])?(?:d{2,4})'
regEx
A+B
Can anyone help in making a regex for extracting all dates mentioned in my dateEntries
?
dateEntries
NOTE: I want to solve this using regex only.
Because my real data has text file in which set
A
categories dates are possible, and text file has other data apart from dates.– Amit Sharma
Jul 1 at 10:52
A
FYI matches single characters and character ranges, not strings like th or st. You should replace with ()
– barny
Jul 1 at 11:01
Your second non-capturing group should probably be optional
– pkpkpk
Jul 1 at 11:05
Try
(?:[s]?d{1,2}[-/th|st|nd|rds]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-zs,./]*(?:d{1,2}[-/th|st|nd|rd)s,]*)?(?:d{2,4})
here– pkpkpk
Jul 1 at 11:10
(?:[s]?d{1,2}[-/th|st|nd|rds]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-zs,./]*(?:d{1,2}[-/th|st|nd|rd)s,]*)?(?:d{2,4})
3 Answers
3
You are just missing a single ?
after the (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
group to mark it as not necessary. Additionally I added a +
behind the last two groups to make sure the regex doesn't split dates like "20 March 2009" into two different dates.
?
(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)
+
The full code:
import re
regEx = r'(?:d{1,2}[-/th|st|nd|rds]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-zs,.]*(?:d{1,2}[-/th|st|nd|rd)s,]*)+(?:d{2,4})+'
dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
result = re.findall(regEx, dateEntries)
print(result)
If your date has leading whitespaces, the result will also have leading whitespaces. If you continue using the date string you could remove them for example with the .strip()
method
.strip()
Thanks @Nils. I'm also able to extract using this regex
r'(?:d{1,2}[-/th|st|nd|rds.])?(?:(?:Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|August|Sep|September|Oct|October|Nov|November|Dec|December)[s,.]*)?(?:(?:d{1,2})[-/th|st|nd|rds,.]*)?(?:d{2,4})'
– Amit Sharma
Jul 1 at 11:22
r'(?:d{1,2}[-/th|st|nd|rds.])?(?:(?:Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|August|Sep|September|Oct|October|Nov|November|Dec|December)[s,.]*)?(?:(?:d{1,2})[-/th|st|nd|rds,.]*)?(?:d{2,4})'
Try Regex:
^(?:d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?s))?(?:(?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)(?:(?:-|/)|(?:,|.)?s)?)?(?:d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?s))?)(?:d{2,4})$
^(?:d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?s))?(?:(?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)(?:(?:-|/)|(?:,|.)?s)?)?(?:d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?s))?)(?:d{2,4})$
Demo
You regex pattern is totally unreadable.. Please build your regex pattern with simple building blocks. That would make the code a lot more readable
import re
import calendar
full_months = [month for month in calendar.month_name if month]
short_months = [d[:3] for d in full_months]
months = '|'.join(short_months + full_months)
sep = r'[.,]?s+' # seperator
day = r'd+'
year = r'd+'
day_or_year = r'd+(?:w+)?'
r = re.compile(rf'(?:{day}{sep})?(?:{months}){sep}{day_or_year}(?:{sep}{year})?')
r.findall(dateEntries)
# ['Mar 20, 2009', 'March 20, 2009', 'Mar. 20, 2009', 'Mar 20 2009', '20 Mar 2009', '20 March 2009', '2 Mar. 2009', '20 March, 2009', 'Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009', 'Feb 2009', 'Sep 2009', 'Oct 2010']
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Why do you want to use a regex? For your example you could just use dateEntries.split(";").
– Nils Schlüter
Jul 1 at 10:49