Regex for extracting all complex dates formats from a string in python

Multi tool use
Multi tool use


Regex for extracting all complex dates formats from a string in python



I have following string:


dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"



Here I want to extract all mentioned dates using regex. As an attempt I have written following regex:


regex


regex


import re

regEx = r'(?:d{1,2}[-/th|st|nd|rds]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[a-zs,.]*(?:d{1,2}[-/th|st|nd|rd)s,]*)?(?:d{2,4})'

re.findall(regEx, dateEntries)



I was expecting this to work but it only return subset of dates.


A = ['Mar 20, 2009',
'March 20, 2009',
'Mar. 20, 2009',
'Mar 20 2009',
'20 Mar 2009',
'20 March 2009',
'2 Mar. 2009',
'20 March, 2009',
'Mar 20th, 2009',
'Mar 21st, 2009',
'Mar 22nd, 2009',
'Feb 2009',
'Sep 2009',
'Oct 2010']



I'm not getting why its not returning the dates:


B=[04-20-2009; 04/20/09; 4/20/09; 4/3/09; 6/2008; 12/2009; 2009; 2010"]



I created the regEx by extending the r'(?:d{1,2}[-s/])?(?:d{1,2}[-/s])?(?:d{2,4})' which works good for set B. But regEx is not able to produce A+B


regEx


r'(?:d{1,2}[-s/])?(?:d{1,2}[-/s])?(?:d{2,4})'


regEx


A+B



Can anyone help in making a regex for extracting all dates mentioned in my dateEntries ?


dateEntries



NOTE: I want to solve this using regex only.





Why do you want to use a regex? For your example you could just use dateEntries.split(";").
– Nils Schlüter
Jul 1 at 10:49





Because my real data has text file in which set A categories dates are possible, and text file has other data apart from dates.
– Amit Sharma
Jul 1 at 10:52


A





FYI matches single characters and character ranges, not strings like th or st. You should replace with ()
– barny
Jul 1 at 11:01





Your second non-capturing group should probably be optional
– pkpkpk
Jul 1 at 11:05





Try (?:[s]?d{1,2}[-/th|st|nd|rds]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-zs,./]*(?:d{1,2}[-/th|st|nd|rd)s,]*)?(?:d{2,4}) here
– pkpkpk
Jul 1 at 11:10


(?:[s]?d{1,2}[-/th|st|nd|rds]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-zs,./]*(?:d{1,2}[-/th|st|nd|rd)s,]*)?(?:d{2,4})




3 Answers
3



You are just missing a single ? after the (?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) group to mark it as not necessary. Additionally I added a + behind the last two groups to make sure the regex doesn't split dates like "20 March 2009" into two different dates.


?


(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)


+



The full code:


import re

regEx = r'(?:d{1,2}[-/th|st|nd|rds]*)?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)?[a-zs,.]*(?:d{1,2}[-/th|st|nd|rd)s,]*)+(?:d{2,4})+'

dateEntries = "04-20-2009; 04/20/09; 4/20/09; 4/3/09; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009; 20 Mar 2009; 20 March 2009; 2 Mar. 2009; 20 March, 2009; Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009; Feb 2009; Sep 2009; Oct 2010; 6/2008; 12/2009; 2009; 2010"
result = re.findall(regEx, dateEntries)
print(result)



If your date has leading whitespaces, the result will also have leading whitespaces. If you continue using the date string you could remove them for example with the .strip() method


.strip()





Thanks @Nils. I'm also able to extract using this regex r'(?:d{1,2}[-/th|st|nd|rds.])?(?:(?:Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|August|Sep|September|Oct|October|Nov|November|Dec|December)[s,.]*)?(?:(?:d{1,2})[-/th|st|nd|rds,.]*)?(?:d{2,4})'
– Amit Sharma
Jul 1 at 11:22



r'(?:d{1,2}[-/th|st|nd|rds.])?(?:(?:Jan|January|Feb|February|Mar|March|Apr|April|May|Jun|June|Jul|July|August|Sep|September|Oct|October|Nov|November|Dec|December)[s,.]*)?(?:(?:d{1,2})[-/th|st|nd|rds,.]*)?(?:d{2,4})'



Try Regex:



^(?:d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?s))?(?:(?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)(?:(?:-|/)|(?:,|.)?s)?)?(?:d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?s))?)(?:d{2,4})$


^(?:d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?s))?(?:(?:(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)(?:(?:-|/)|(?:,|.)?s)?)?(?:d{1,2}(?:(?:-|/)|(?:th|st|nd|rd)?s))?)(?:d{2,4})$



Demo



You regex pattern is totally unreadable.. Please build your regex pattern with simple building blocks. That would make the code a lot more readable


import re
import calendar

full_months = [month for month in calendar.month_name if month]
short_months = [d[:3] for d in full_months]
months = '|'.join(short_months + full_months)

sep = r'[.,]?s+' # seperator
day = r'd+'
year = r'd+'
day_or_year = r'd+(?:w+)?'

r = re.compile(rf'(?:{day}{sep})?(?:{months}){sep}{day_or_year}(?:{sep}{year})?')
r.findall(dateEntries)
# ['Mar 20, 2009', 'March 20, 2009', 'Mar. 20, 2009', 'Mar 20 2009', '20 Mar 2009', '20 March 2009', '2 Mar. 2009', '20 March, 2009', 'Mar 20th, 2009', 'Mar 21st, 2009', 'Mar 22nd, 2009', 'Feb 2009', 'Sep 2009', 'Oct 2010']






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

bCuiFv,v NnX,XHIGOUJ4iC2WrRm aGXb,pj16fdcj0DEOj41W0Q2TMsdXjHIjiU2NzkXIcPon4mnk1
7odD8AaBM dk00iLP,8LYxAsWkQ27R1 ZDsd

Popular posts from this blog

Boo (programming language)

Rothschild family