Assertions are explict expressions of what you assume your program is expected to deal with. An assertion prevents the impossible from being asked of your code (at least the impossible things you can think of!)
The assert statement...
Our hypothesis is that the error is a result of the variable tag being set, so we can use assert to raise an exception if tag should ever be set.
#!/usr/bin/python3
def removeHtmlMarkup(s):
tag = False
quote = False
out = ""
for c in s:
assert not tag # NEW
if c == '<' and not quote: # Start of markup
tag = True
elif c == '>' and not quote: # End of markup
tag = False
elif c == '"' or c == "'" and tag: # Quote
quote = not quote
elif not tag:
out = out + c
return out
""" We know this failed """
if __name__ == "__main__":
print (removeHtmlMarkup('"foo"'), '\t["foo"]')
Here again is the only place where quotes are handled:
elif c == '"' or c == "'" and tag: # Quote
quote = not quote
We can test this with another assert:
#!/usr/bin/python3
def removeHtmlMarkup(s):
tag = False
quote = False
out = ""
for c in s:
assert not quote
# assert quote
if c == '<' and not quote: # Start of markup
tag = True
elif c == '>' and not quote: # End of markup
tag = False
elif c == '"' or c == "'" and tag: # Quote
quote = not quote
elif not tag:
out = out + c
return out
""" We know this failed """
if __name__ == "__main__":
print (removeHtmlMarkup('"foo"'), '\t["foo"]')
We find that the assertion raises and exception, so we know that we are entering the block where the quote variable is changed.
#!/usr/bin/python3
def removeHtmlMarkup(s):
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote: # Start of markup
tag = True
elif c == '>' and not quote: # End of markup
tag = False
elif c == '"' or c == "'" and tag: # Quote
assert False # Should never be reached NEW
quote = not quote
elif not tag:
out = out + c
return out
""" We know this failed """
if __name__ == "__main__":
print (removeHtmlMarkup('"foo"'), '\t["foo"]')
i.e. We know that:
Is the problem with quotes general? Are single quotes stripped in the same way?
We modify our test code to test this hypothesis:
#!/usr/bin/python3
def removeHtmlMarkup(s):
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote: # Start of markup
tag = True
elif c == '>' and not quote: # End of markup
tag = False
elif c == '"' or c == "'" and tag: # Quote
quote = not quote
elif not tag:
out = out + c
return out
""" Our tests """
if __name__ == "__main__":
print (removeHtmlMarkup('"foo"'), '\t["foo"]')
print (removeHtmlMarkup("'foo'"), "\t['foo']") # NEW TEST
The condition
elif c == '"' or c == "'" and tag: # Quote
quote = not quote
is
We now have enough information to see exactly what is going on.
'and' takes precedent over 'or'. Consequently the code:
elif c == '"' or c == "'" and tag: # Quote
quote = not quote
is equivalent to:
elif c == '"' or (c == "'" and tag): # Quote
quote = not quote
when what we wanted was:
elif (c == '"' or c == "'") and tag: # Quote
quote = not quote
#!/usr/bin/python3
def removeHtmlMarkup(s):
tag = False
quote = False
out = ""
for c in s:
if c == '<' and not quote: # Start of markup
tag = True
elif c == '>' and not quote: # End of markup
tag = False
elif (c == '"' or c == "'") and tag: # Quote
quote = not quote
elif not tag:
out = out + c
return out
""" Our tests """
if __name__ == "__main__":
print (removeHtmlMarkup('"foo"'), '\t["foo"]')
print (removeHtmlMarkup("'foo'"), "\t['foo']")
# Old tests
print ("Old tests...")
print (removeHtmlMarkup('<b>foo</b>'), '\t[foo]')
print (removeHtmlMarkup('<em>foo</em>'), '\t[foo]')
print (removeHtmlMarkup('<a href="foo.html">foo</a>'), '\t[foo]')
print (removeHtmlMarkup('<a href="">foo</a>'), '\t[foo]')
print (removeHtmlMarkup('<a href=">">foo</a>'), '\t[foo]')
print (removeHtmlMarkup('<b>foo</b>'), '\t[foo]')
print (removeHtmlMarkup('<b>"foo"</b>'), '\t["foo"]')
print (removeHtmlMarkup('"<b>foo</b>"'), '\t["foo"]')
print (removeHtmlMarkup('<"b">foo</"b">'), '\t[foo]')
What about the case of wanting to keep tags if they are in quotes? e.g.
<b>We want to keep "<thistag>"</b>
should give
We want to keep "<thistag>"
We would need a state machine with four states:
States: no-tag,no-quote / tag,no-quote / tag,quote / no-tag,quote