Add parser for flawfinder#31
Conversation
firehose/parsers/flawfinder.py
Outdated
| weakness_messages.append(message) | ||
| # TODO: Flawfinder can returns more than one CWE, | ||
| # when this rappends, get_cwe cannot return a valid value | ||
| # Fix this. |
There was a problem hiding this comment.
Do we need to support more than one CWE per issue in the schema? Currently <cwe> is optional within <issue>; should it be zeroOrMore?
There was a problem hiding this comment.
I was going to ask this. I don't know what the best pratice in this case. When flawfinder returns more than one CWE, what should i do?
There was a problem hiding this comment.
Can you post some examples of what the output looks like?
My current opinion is that we should change the schema and model.py to support multiple CWEs per issue.
There was a problem hiding this comment.
./docs/examples/rtsp.c:170: [4] (buffer) sscanf:
The scanf() family's %s operation, without a limit specification, permits
buffer overflows (CWE-120, CWE-20). Specify a limit to %s, or use a
different input function.
./docs/examples/synctime.c:155: [4] (buffer) sscanf:
The scanf() family's %s operation, without a limit specification, permits
buffer overflows (CWE-120, CWE-20). Specify a limit to %s, or use a
different input function.
In my opinion, Add multiple CWEs would generate a more realistic output, once one warning could be related to more than one CWE.
There was a problem hiding this comment.
I can send another PR doing this.
There was a problem hiding this comment.
Thanks. I'll update the schema accordingly.
There was a problem hiding this comment.
I will first make this change in models, then i finish the flawfinder parser.
There was a problem hiding this comment.
@davidmalcolm I have a doubt. Projects that already use firehose, and pass to Issue model a single cwe using a integer value. Should we maintain the compatibility? (Allows de user to pass a integer or a list of integers). Or we keep only a list of integers?
There was a problem hiding this comment.
This change impacts the others parsers too.
76666dd to
d10b185
Compare
|
Hey @davidmalcolm , i have updated this PR. I added some tests, and have included flawfinder in the documentation. Now the parser capture the first CWE, as we had discussed. I created the issue #35, to map the multiple cwes problem. |
davidmalcolm
left a comment
There was a problem hiding this comment.
Thanks; this is looking good. I have some concerns about version-handling, and would prefer it if the parser captured the severity of issues (and maybe categories also).
Sorry if this seems nit-picky.
firehose/parsers/flawfinder.py
Outdated
| """ | ||
|
|
||
| generator = Generator(name='flawfinder', | ||
| version=flawfinder()) |
There was a problem hiding this comment.
Note that a Generator's version can be None. It looks like the actual version is embedded in the first line of the output:
Flawfinder version 1.31, (C) 2001-2014 David A. Wheeler.
so presumably parse_file should be looking for lines like that and parsing them.
There was a problem hiding this comment.
Fixed. Now i get the version from the Flawfinder output.
firehose/parsers/flawfinder.py
Outdated
| def get_cwe(line): | ||
| """Extract CWE from flawfinder alarms | ||
| It's possible that a alarm not have a correlated CWE | ||
|
|
There was a problem hiding this comment.
Should have a comment here referencing issue #35, and that it's just the first CWE if there's more than one.
firehose/parsers/flawfinder.py
Outdated
| try: | ||
| return check_output(["flawfinder", "--version"]).strip() | ||
| except Exception: | ||
| return '1.31' |
There was a problem hiding this comment.
As noted above, I'd much prefer it if parse_file parsed the version header in the output.
If the output file doesn't contain a version string, then I think it's better for the parser to leave the Generator's version as None, rather than trying to scrape it from flawfinder --version (or guessing). Rationale: I use firehose in mock-with-analysis, which does the analysis in a chroot. There might be a different version of flawfinder in the chroot compared to what's in the host $PATH.
So I think it's better to get rid of this function and do the version-handling in parse_file.
firehose/parsers/flawfinder.py
Outdated
| line = loaded_file.readline() | ||
|
|
||
| counter = 0 | ||
| for weakness in weakness_paths: |
There was a problem hiding this comment.
Rather than manually incrementing counter, could this code use Python's enumerate builtin? (and maybe use idx rather then counter?)
firehose/parsers/flawfinder.py
Outdated
|
|
||
| issue = Issue(cwe, None, location, | ||
| Message(text=weakness_messages[counter]), notes=None, | ||
| trace=None, severity=None, customfields=None) |
There was a problem hiding this comment.
Re severity: Flawfinder's docs say:
The risk level is shown inside square brackets and varies from 0, very little risk, to 5, great risk.
It would be good to capture that number as Issue.severity (as a string, without the square brackets):
http://firehose.readthedocs.io/en/latest/data-model.html#firehose.model.severity
firehose/parsers/flawfinder.py
Outdated
| else: | ||
| cwe = int(weakness_cwes[counter]) | ||
|
|
||
| issue = Issue(cwe, None, location, |
There was a problem hiding this comment.
The None here is the Issue.testid:
http://firehose.readthedocs.io/en/latest/data-model.html#firehose.model.testid
Perhaps the (shell) system: stuff should be captured here. If I'm reading the Flawfinder docs correctly:
Each entry has a filename, a colon, a line number, a
risk level in brackets (where 5 is the most risky), a category, the name of the function, and a description of
....then Flawfinder's "category" here is "shell" and the name of the function is "system". I'm not yet sure how that should map to a firehose Issue.testid.
There was a problem hiding this comment.
@davidmalcolm how should we handle this? Apprently flawfinder does not have a testid value.
There was a problem hiding this comment.
Can we use flawfinder's "category" here for the FIrehose "testid"? Looking at the sample output, that would give values like "shell", "format", "buffer", "race", which seems sane.
|
Hey @davidmalcolm , i updated this PR with the last revision. |
davidmalcolm
left a comment
There was a problem hiding this comment.
Thanks for the updates.
I've got a few more nitpicks (sorry), covering code I had trouble following.
firehose/parsers/flawfinder.py
Outdated
|
|
||
| :arg1: file passed by cli | ||
| :returns: file loaded in memory | ||
| """ |
There was a problem hiding this comment.
This doesn't actually load a file into memory; it merely returns a file object representing a file opened for reading. The code would be clearer if you got rid of this function and replace users with just the open builtin.
firehose/parsers/flawfinder.py
Outdated
| """ Parser loaded flawfinder output | ||
|
|
||
| :loaded_file: flawfinder report, loaded in memory | ||
| :returns: Firehose Analysis object, representing the final XML. |
There was a problem hiding this comment.
"loaded_file" is actually a file-like object, not an in-memory string. Please rename to "infile", or similar ("input" is a builtint IIRC)
firehose/parsers/flawfinder.py
Outdated
| data["message"] = message; | ||
| line = message_line | ||
| flawfinder_data.append(data) | ||
| else: |
There was a problem hiding this comment.
I'm finding it hard to follow this two-stage parsing. If I'm reading things right, the code first reads the lines, building up a list of dicts (flawfinder_data). There's then a second loop over that list, creating Issue objects and adding them to analysis.results.
Wouldn't it be much simpler to have one loop, and append Issue objects directly to analysis.results, without the data dict i.e. work directly with an Issue object instead.
Hope that makes sense; I may have misread this code.
firehose/parsers/flawfinder.py
Outdated
| int(line[cwe_first_int_char]) | ||
| cwe_first_int_char += 1 | ||
| except Exception: | ||
| cwe = line[(cwe_init_str_pos + 4):(cwe_first_int_char)] |
There was a problem hiding this comment.
Wouldn't this be much simpler with a regex e.g.
re.findall(line, "CWE-([0-9]+)")
or somesuch? (which I think would give a list of tuples of strings).
Could then rename the function to find_cwes and have it return a list of ints, and do the dropping of multiples in the caller.
Can also write a unit test for the CWE parser that way.
|
|
||
|
|
||
| def flawfinder(first_line): | ||
| """Retrieve flawfinder version from report. |
There was a problem hiding this comment.
I find the current name to be unclear; please rename to parse_first_line (as that's what it does).
There was a problem hiding this comment.
Maybe rename to get_flawfinder_version should be better, what do you think @davidmalcolm ?
|
|
||
| class TestParseXml(unittest.TestCase): | ||
| def parse_example(self, filename): | ||
| report = load_file(os.path.join(os.path.dirname(__file__), |
There was a problem hiding this comment.
Given that it's a file-like object, maybe turn into:
path = os.path.join(...)
with open(path) as f:
return parse_file (f)
863d9a5 to
8bf2e8a
Compare
|
@davidmalcolm thanks for the great revision, i had updated the PR. |
4c64931 to
65ac128
Compare
davidmalcolm
left a comment
There was a problem hiding this comment.
Thanks for the updated version.
This is looking much better, but I have a few more nitpicks (sorry).
firehose/parsers/flawfinder.py
Outdated
| arg_file = sys.argv[1] | ||
| report = open(arg_file, 'r') | ||
| except IOError: | ||
| print("File note found") |
There was a problem hiding this comment.
Typo: "note" should be "not".
But an IOError is not necessarily "file not found"; it could be a permissions error, or an attempt to open a directory rather than a file, etc. Maybe just lose this clause?
firehose/parsers/flawfinder.py
Outdated
| (prog.search(message_line) and prog2.search(message_line)): | ||
| issue_message += " " + message_line | ||
| try: | ||
| cwe = re.findall("CWE-([0-9]+)", message_line)[0] |
There was a problem hiding this comment.
I don't like using try/except for parsing; I prefer to use non-exception program flow (rationale: sometimes we end up catching a different exception than the one we thought we were, so it can conceal bugs).
Anything that matches the ([0-9]+) is known to be convertible to int, so how about:
# list of ints
cwes = []
before the "while message_line" loop, and:
cwes.extend([int(m.group(0))
for m in re.findall("CWE-([0-9]+)", message_line)])
here, and then after the "message_line" loop:
if cwes:
first_cwe = cwes[0]
else
first_cwe = None
to avoid the two try/except clauses.
Please reinstate the comment about issue #35
firehose/parsers/flawfinder.py
Outdated
| try: | ||
| testid = re.findall(r"\(([a-z]+)\)", line)[0] | ||
| except IndexError: | ||
| pass |
There was a problem hiding this comment.
If we just want the first match, just use re.search, which returns None for the "not-found" case ( https://docs.python.org/3.4/library/re.html#re.search ).
|
@davidmalcolm I had updated the PR |
davidmalcolm
left a comment
There was a problem hiding this comment.
Thanks for your improvements here.
Ironically, with the code cleanups you've done, you've simplified things enough that I can now see a couple of other issues that were hiding.
| testid = re.search("\(([a-z]+)\)", line).group(1) | ||
| issue_path = line.split(":")[0] | ||
| issue_line = line.split(":")[1] | ||
|
|
There was a problem hiding this comment.
Thanks for doing the earlier cleanups; the code is clearer now.
With that, the handling of prog, prog2 and the testid looks a bit unusual; presumably these always appear on the same line, in a consistent way. Given that, I think it would be improved by using a single regular expression; something like:
# We want to match a line like:
# ./docs/examples/asiohiper.cpp:78: [4] (shell) system:
filename_group = "(.+)"
linenum_group = "([0-9]+)
severity_group = "\[([0-9])\]"
category_group = "\((.+)\)"
funcname_group = "(\S+)"
ws = '\s+'
PAT= (filename_group + ':' + linenum_group + ':' + ws + severity_group + ws
+ category_group + ws + funcname_group)
It should then be possible to do:
m = re.match(PAT, message_line)
if m:
# do stuff with m.groups()
firehose/parsers/flawfinder.py
Outdated
|
|
||
| issue = Issue(first_cwe, testid, location, | ||
| Message(text=issue_message), notes=None, | ||
| trace=None, severity=None, customfields=None) |
There was a problem hiding this comment.
Please set the severity field to the number in square parens (keep it as a string, rather than calling int on it).
| cwes = [] | ||
| while message_line != "\n" and not \ | ||
| (prog.search(message_line) and prog2.search(message_line)): | ||
| issue_message += " " + message_line |
There was a problem hiding this comment.
You could reuse the PAT from above here.
- Add only one cwe in Firehose report - Retrieve flawfinder version from report. - Use enumerate instead of counter. - Add tests - Add comment about multiple cwes. - Fix regex.
|
@davidmalcolm I had updated the PR |
davidmalcolm
left a comment
There was a problem hiding this comment.
Thanks for all the updates; this looks good.
I'm still working on this pull request (missing tests). soon i will remove the 'WIP' tag.