(.+?)", re.DOTALL | re.M) tag_re = re.compile(r"<.+?>") punct_re = re.compile(r"\W") # Repairs; regex use is order dependent! s = s.lower() s = pre_re.search(s).group(1) s = tag_re.sub("", s) s = punct_re.sub("", s) return palindrome_detector(s) #--> """=================================================================== 4. [4 points] Cristian Danescu-Niculescu-Mizil released a large corpus of U.S. Supreme Court Dialogues: https://confluence.cornell.edu/display/llresearch/Supreme+Court+Dialogs+Corpus For this problem, download and unpack the corpus, and open up the file supreme.conversations.txt so that you can get a feel for its format. Your goal is to produce a dictionary mapping each justice name to the number of words he or she speaks in the corpus. For this, you need to write three functions: a. supremes_tokenizer: a tokenizer (you can reuse one you wrote previously). b. supremes_iterator: an iterator function that goes through supreme.conversations.txt line-by-line, yielding a dict with this structure: CASE_ID (str): unique id of the case UTTERANCE_ID (int): unique id of the utterance AFTER_PREVIOUS (bool): True or False (the Python booleans) SPEAKER (str): from the file IS_JUSTICE in {JUSTICE, NOT JUSTICE} (bool): True if file value is 'JUSTICE', else False JUSTICE_VOTE (str): the string from the file or None (the Python object) if the str is 'NA' PRESENTATION_SIDE (str): the string from the file UTTERANCE (str): the text produced by the speaker WORDS (list of strs): UTTERANCE tokenized according to your tokenizer c. supremes_wordcounts: the function that uses supremes_tokenizer and supremes_iterator to produce the count dictionary.""" def supremes_tokenizer(s): #<-- return s.lower().strip().split() #--> def supremes_iterator(filename): # Danescu's custom field separator: sep = " +++$+++ " #<-- for line in open(filename): vals = line.strip().split(sep) yield {'CASE_ID': vals[0], 'UTTERANCE_ID': int(vals[1]), 'AFTER_PREVIOUS': True if vals[2] == 'TRUE' else False, 'SPEAKER': vals[3], 'IS_JUSTICE': True if vals[4] == 'JUSTICE' else False, 'JUSTICE_VOTE': vals[5], 'PRESENTATION_SIDE': vals[6], 'UTTERANCE': vals[7], 'WORDS': supremes_tokenizer(vals[7])} #--> def supremes_wordcounts(filename): # Your output dictionary: counts = {} # Here, you use supremes_iterator on your filename: for d in supremes_iterator(filename): # Gather the word counts by justice (ignoring non justice speakers): #<-- if d['IS_JUSTICE']: spk = d['SPEAKER'] counts[spk] = counts.get(spk, 0) + len(d['WORDS']) #--> return counts """=================================================================== 5. [4 points] The G-test is a statistical test that is similar to the chi-square test. The Wikipedia page is a solid introduction: http://en.wikipedia.org/wiki/G-test Your task is to implement the G-test. I've supplied the code for testing statistical significance, so you don't have to write that. For it to run, you do need to install scipy; see the course website's 'Resources' page for links. """ def gtest(observed): """The input is a list of lists (or a numpy array, if you prefer). The output is a dictionary providing statistical info.""" ## Import the needed numpy and scipy libraries: import numpy as np from scipy.stats import chi2 ## Try to ensure the right input type: observed = np.asarray(observed) ## In here, you need to calculate the g-statistic and ## the matrix of expected values: #<-- # Expectations: csum = np.sum(observed, 0) rsum = np.sum(observed, 1) total = float(np.sum(observed)) nrow, ncol = observed.shape expected = np.outer(rsum, csum) / total g = 2.0 * np.sum(observed * np.log(observed / expected)) #--> # Significance: ## Degree of freedom based on the dimensions of observed (nrow is ## the number of rows in observed, ncol the number of columns). df = (nrow - 1) * (ncol - 1) ## g is your primary test statistic, calculated with your own code: p = 1.0 - chi2.cdf(g, df) # Stats dictionary to return: stats = {} # Store the input matrix for the user: stats['observed'] = observed # This is a matrix of the same dimension as observed: stats['expected'] = expected # Your g-statistic: stats['G'] = g # The p-value: stats['p'] = p return stats """=================================================================== 6. [4 points] This problem asks you to identify the main connective in formulae of propositional logic. The input is a formula, as a str, and the output is a connective. The recursive definition of 'main connective' is as follows: a. For any atomic letter, the main connective is None b. Negation: where F is any formula, the main connective of ~F is ~ c. Conjunction: where F and G are formulae, the main connective in (F & G) is & d. Disjunction: where F and G are formulae, the main connective in (F | G) is | e. Conditional: where F and G are formulae, the main connective in (F > G) is > The definition is recursive in the sense that F and G can be formulae of arbitrary complexity. Examples: main_connective('~(p & q)') returns ~ main_connective('((p & q) | ~(r | p))') returns | main_connective('~((p & q) | ~r)') returns ~ Complete the function main_connective so that it returns the main connective of any well-formed formula. The key to writing this function is counting brackets in a certain way. You can assume that you always get well-formed and fully bracketed input strings. Do not assume that the input string has regular whitespace. Your function should interpret '(p&q)', '(p &q)', etc. Assume that atomic formula can be of any length but cannot contain spaces or any of the connective symbols or parentheses. Extra credit (up to 2 points): write a function that takes as input a formula and a dict interpreting the propositional letters, and returns the interpretation of that formula. For example pl_interpretation('~(p | q)', {'p': True, 'q': False}) returns False If you do this, you might modify main_connective so that it returns not only the connective but also the connective's argument(s).""" def main_connective(phi): return pl_parse(phi)[0] def pl_parse(phi): #<-- # Regularize by deleting all spaces; safe because atomic formula # cannot contain connectives or brackets. phi = phi.replace(' ', '') # Atomic: if not re.search(r"[\(\)~|>&]", phi): return (None, [phi]) # Negation: if phi.startswith('~'): return ('~', [phi[1: ]]) # Binary: bc = 0 for i, c in enumerate(phi): if c == "(": bc += 1 elif c == ")": bc -= 1 if bc == 1 and phi[i+1] in ('&', '|', '>'): # Connective plus its arguments with outer brackts stripped off: return (phi[i+1], [phi[1:i+1], phi[i+2:-1]]) #--> def pl_interpretation(phi, sem): if phi in sem: return sem[phi] else: connective, args = pl_parse(phi) if connective == '~': return not pl_interpretation(args[0], sem) elif connective == '|': return pl_interpretation(args[0], sem) or pl_interpretation(args[1], sem) elif connective == '&': return pl_interpretation(args[0], sem) and pl_interpretation(args[1], sem) elif connective == '>': return (not pl_interpretation(args[0], sem)) or pl_interpretation(args[1], sem) def pl_test(): sem = {'atomic': True, 'p':True, 'q': False, 'r':False} tests = (('atomic', None, True), ('~atomic', '~', False), ('~(p | q)', '~', False), ('(p > q)', '>', False), ('((p & q) > (p | q))', '>', True), ('(~(p & q) | ~(r | p))', '|', False)) err_count = 0 for phi in tests: mc = main_connective(phi[0]) meaning = pl_interpretation(phi[0], sem) if mc != phi[1] or meaning != phi[2]: print "======================================================================" print "Error for %s:" % phi[0] print "Chosen main connective is %s and actual is %s" % (mc, phi[1]) print "Chosen meaning is %s; actual is %s" % (meaning, phi[2]) err_count += 1 print "pl_test error count: %s" % err_count #<-- if __name__ == '__main__': print palindrome_filereader() pl_test() #-->