Sha256 Hash Dictionary attack Assignment with Python

Assignment #1

Requirements
1. At least 8 characters long
2. Contains both upper and lower-case letters
3. Contains one or more numerical digits and special characters.
4. Must be written in C/C++/Java/Python
5. Time limit for key/100 seconds
6. No salt keys. -> http://www.xorbin.com/tools/sha256-hash-calculator
7. All Program is smaller than 30MB

블로그 글을 너무 대충 작성해서 다시 쓰기로 했다. 구글에 검색하니 나오는게 신기하기도 했고..

먼저 이 과제의 목적은, 일반적으로 알려진 비밀번호를 사용하면 안된다는 것이다.
대부분의 비밀번호로 사용되는 hash 값들은 salt를 이용하여 이중 비밀번호를 진행한다.
따라서 one way 암호화를 진행하는데, 이 과정을 거치지 않을 경우에 비밀번호가 저장된 데이터베이스가 털렸을시 비밀번호 guessing이 가능하다는것을 알려주고자 하는 과정이다

요즘은 비밀번호 자릿수를 무조건 8자리이상에, 특문 하나를 넣어야 된다지만,
예전엔 그렇지 않았다는것을 누구나 안다. 6자리이상에 숫자 안넣어도, 아니, 숫자만 넣어도 비밀번호를 넣을 수 있었다.
이런 취약점들덕에 비밀번호 제도가 바뀌게 되었는데... 사설이 너무 길었고

어쨌든 이 과제에서는 SHA256으로 암호화된 비밀번호를 역으로 알아내야 하는것인데, 공격법은 세가지 정도로 나뉜다.

1. 역상 공격(preimage Attack)
Secure한 SHA256상대로 할 수 없다. 할 수 있을진 모르겠는데 일단 내 실력으론 어림도 없는건 확실했다.

2. 무차별 대입 공격(Brute force Attack)
SHA256에 할당된 8자리의 레인보우 테이블만 4테라바이트라고 한다....

3. 사전 공격(Dictionary Attack)
결국 세가지 방법중에 미리 정의한 잘 알려진 단어와 유출된 사용자들이 자주 쓰던 비밀번호를 사용해서 사전을 만들고, 그를 SHA256 hash로 다시 정의하여 찾는 방법이 합당해보였다

우선 Password Dictionary라고 검색하면 어렵지 않게 정의된 사전들을 찾을 수 있다.
처음에 접근해보길, 모든 단어를 커버할 수 없으니 8자리만을 선택하자. 그런데 대소문자가 반드시 필요되시되고 특문 하나 숫자 하나도 들어가니까, 임의의 글자를 만들어 넣자라고 생각했으나, 프로그램 크기가 30메가 제한이라는 소식을 접하게 되었다.

처음엔 6자리 단어에 임의로 특문도 넣고 그랬는데 그랬더니 사전의 크기가 너무 커지게 되어서 비효율적이게 되었다.
그리고 만들어놓고 보니 Pass1[#2 뭐 이런꼴인데 사실 이렇게 비번 쓰는 사람도 없을 것 같았다..
그래서 딕셔너리를 다음과 같은 조건으로 거르고 해싱을 해보았다.

조건은,
1. 8~20자리의 글자
2. 숫자가 포함되어 있을것
3. 대문자가 포함되어 있을 것

i = 1
with open("file.txt") as f:
 for line in f:
     raw_pw = line.split(":")[0].replace('\r\n','').replace('\n','')
     if len(raw_pw) < 20 and len(raw_pw) >= 8:
          if re.search("[\!\-\~]",raw_pw) > -1:
               if re.search("[0-9]",raw_pw) > -1: 
                  if re.search("[A-Z]",raw_pw) > -1: 
                     hash_text = hashlib.sha256(raw_pw).hexdigest()
                     fp = open("aa.out","a")
                     fp.write(str(i)+":"+raw_pw+":"+hash_text+"\r\n")
                     fp.close()
                     i += 1

더 깔끔하게 코딩할 수 있을것 같은데 일단 이런식으로 해봤다..
이렇게 하면 백몇십메가나 되던 사전의 크기가 엄청나게 압축된다.
실제로 백만줄이 넘어가던 사전이 1메가도 안된다 지금은..
크기는 30메가 제한인데 제출과 현재 블로그 글을 쓰면서도 불안하나 그냥 하늘에 맡기기로 했다..
그 결과는 이곳(Github)에 있다.

나는 사실 여기까지 만들면서 사전만 잘 구성하면 될거라고 생각 했다. 사실 그게 핵심이고..
하지만 문제가 있었는데, 제출당일날 속도 제한이 있다는것을 깨달았다.
테스트로 돌려본 코드는 I/O를 이용해서 1개 키를 매칭해서 찾고 매칭해서 찾고 하는것이었다.
이건 망했으니 github엔 올리지 않고 여기에서 다룬다..

 import timeit
start = timeit.default_timer()

i = 1
with open("hashedPasswords.txt") as target:
      for line in target:
          user = line.split(":")[1].replace("\r\n","")
          target_word = line.split(":")[2].replace("\r\n","")
          with open("out_a.out") as dic:
              for dic_word in dic:
                  if target_word == dic_word.split("|")[1].replace("\r\n",""):
                      findit = dic_word.replace("\r\n","")
                      fp = open("Passwords.txt","a")
                      fp.write(str(i)+":"+user+":"+findit.split("|")[0]+"\r\n")
                      fp.close()
                      i+=1
                      break
                  
stop = timeit.default_timer()
print "Program Run Time : " + str(stop - start)

굳이 이름을 붙여보자면, Find word in large text file using python... zzz...
기가 막히지만, 말그대로다. 파일 포인터를 두개 열고 일대일로 나올때까지 매칭된다.
아마 사전의 크기가 크고 시도되는 키가 많다면 밤새도록 돌려야될지도 모른다.
내가 쓰려고 짰던건 아니고... 뭐 그렇다.

나는 이거랑 다르게 쓰레드를 이용해보기로 했다.
Python은 싱글코어만 사용하기때문에 쓰레드가 좋지 않아 프로세스를 늘리는 방식으로 사용해야 된다고 본적이 있다.
(참고: 파이썬으로 클라우드 하고 싶어요)

그래서 애초에 프로세스를 한개 더 늘려서 만들어보기로 했다.
사실 syncronized variable을 어떻게 만들어야될지 감도 안와서 추잡하게 다 쓰고 다시 읽어서 넘버링하고, 임시파일을 지우는 꽁수를 적어놨는데, 나는 코드의 신에게 분명히 맞아죽을것이다...
어쨌든 시도 자체는 훌륭했다고 생각한다. 프로세스를 물려서 뭔가 처릴 해보려고 했으니....... 남는건 있었겠지.
이렇게 돌린 코드는... 600초가 소모됐다 -_-

#!/usr/bin/env python

# Author     : Myeong-Uk (mwsong@imtl.skku.ac.kr)
# Date       : 2015. 03. 30
# Desc       : Python program to dictionary attack SHA256
# Dependency : out_c.out (Password dictionary), hashedPasswords.txt(input)
# Command    : <python find.py>
# OUTPUT     : Passwords.txt

import timeit,os
from multiprocessing import Process,Lock
  
  
start_time = timeit.default_timer() #when program start

def do_work(type):
    hash_count = 0
    with open('aa.out') as target:
         for line in target:
            findit = ""
            if type == "1" and hash_count%2 == 0:
                findit = search_word(line)
            elif type== "2" and hash_count%2 > 0:
                findit = search_word(line)
            hash_count +=1

            if findit != "" and findit != None:
                 user = findit.split("|")[0]
                 crackedPassword = findit.split("|")[1]

                 fp = open("Passwords_tmp.txt","a")
                 fp.write(user+":"+crackedPassword+"\r\n")
                 fp.close()

  
def search_word(word):
    user = str(word.split(":")[1].replace("\r\n",""))
    target_hash = word.split(":")[2].replace("\r\n","")

    with open("out_c.out") as dic:
        for dic_line in dic:

            dictionary_hash = dic_line.split("|")[1].replace("\r\n","")
            dictionary_word = dic_line.split("|")[0].replace("\r\n","")

            if target_hash == dictionary_hash:
                 return user + "|" + dictionary_word
                 break
 
def numbering():
    i = 0
    with open("Passwords_tmp.txt") as dic:
        for find_word in dic:    
             fp = open("Passwords.txt","a")
             fp.write(str(i)+":"+find_word)
             fp.close()
             i += 1
    print str(i) + " Password(s) found."

 
     
if __name__ == '__main__':

    pr1 = Process(target=do_work,args=(str(1)))
    pr2 = Process(target=do_work,args=(str(2)))
      
    pr1.start()
    pr2.start()
      
    pr1.join()
    pr2.join()


    stop_time = timeit.default_timer() #when program end

    print "Dictionary matching done"

    print "Numbering.."
    numbering()
    os.remove("Passwords_tmp.txt")
    print "Done!!"

    print "Program Run Time : " + str(stop_time - start_time)

사실 이걸 다 짜고 낼려는데 그제서야 시간제한이 있다는걸 알게 됐다
이대로 내면 100초에 끊을 경우 넘버링으로 다시 쓰는것도 안되기때문에 실제로 다 찾더라도 0이 될께 뻔했다
고로 이 코드는 못쓰는 코드라고 생각하고 다시 생각을 하기 시작했다.

다시 생각을 해보니까, 이런 생각이 들었다. 얘네를 왜 굳이 I/O를 이용해야 하는가?에 대한..
애초에 작은 사전을 만들었으니까, 모두 string 화 해서 사용하면 안될까? 생각이 들었다.
이때 stack overflow에서 찾은 이 글에서 영감을 많이 받았다.
요는, 파이썬은 루프는 느리다. 램에 다 올려서 string으로 찾아라. 그건 굉장히 빠르다.
그래서 덤프를 뜨고 메모리에서 비교를 해보기로 했다.

import timeit,re
from multiprocessing import Process
  
  
start_time = timeit.default_timer() #when program start

pass_dump = ""
hash_dump = ""


def init():
    global pass_dump, hash_dump
    f = open("out_c.out","r")
    pass_dump = f.read()

    f = open("aa.out")
    hash_dump = f.read()



def do_work():
   lines = hash_dump.split("\r\n")
   for i in lines:
    search_word(i.split(":")[2])
  
def search_word(word):
    match = re.search(r'.*'+word,pass_dump)
    if match:
        print match.group(0)
     
if __name__ == '__main__':

    init()

    do_work()

    stop_time = timeit.default_timer() #when program end

    print "Dictionary matching done"

    print "Numbering.."
    #numbering()
    #os.remove("Passwords_tmp.txt")
    print "Done!!"

    print "Program Run Time : " + str(stop_time - start_time)

요점은 한번에 싹 읽고, 정규식을 이용해 해당 라인을 다 가져와서 처리해보겠다고 한것이다.
정규식쓰면 당연히 빠르겠거니 하고 생각했는데, 어찌됐던 이전보단 빨라보이긴했는데 아니 끝이 안난다.
이 코드 검증은 안해서 잘 모르겠지만 그냥 망했다는 삘이 왔다. 너무 느렸다..

그래서 먼저 한 후배한테 물어보니까 그냥 딕셔너리 기본 옵션으로 구현했다고 한다.
무슨차이가 있을까.. 정규식보다 더 빠를 수 있을까? 속는 심정으로 코드를 짰다.

import timeit

start_time = timeit.default_timer() #when program start

pass_dump = {}
hash_dump = {}
count = 0

def init():
    global pass_dump, hash_dump
    with open("dictionary.out","r") as dic:
       for a in dic:
        key = a.split("|")[1].replace("\r\n","").replace("\n","") #hash
        value = a.split("|")[0] #original text
        pass_dump[key] = value

    with open("hashedPasswords.out","r") as dic:
       for a in dic:
        key = str(a.split(":")[2].replace("\r\n","").replace("\n","")) #hash
        value = a.split(":")[1] #user
        hash_dump[key] = value

    print "Init finished.."

def do_work():
    global hash_dump,pass_dump,count

    for a in hash_dump:
        try:
            if pass_dump[a]:
                count += 1
                fp = open("Passwords.txt","a")
                fp.write(str(count)+":"+hash_dump[a]+":"+pass_dump[a]+"\r\n")
                fp.close()
        except KeyError:
            pass
  
def search_word(word):
    match = re.search(r'.*'+word,pass_dump)
    if match:
        print match.group(0)
     
if __name__ == '__main__':

    init()
    do_work()

    stop_time = timeit.default_timer() #when program end

    print "Done!!"
    print str(count) + " Password(s) found. Recorded at Passwords.txt"
    print "Program Run Time : " + str(stop_time - start_time)

겉치레 이런거 없이, 이 코드는 그냥 엔터치면 검색이 끝났다...
배열처럼 사용하는 딕셔너리의 키값에 해쉬를 넣고 반환되는 값이 있나 없나 비교해보는것으로 모든 면에서 우수한것을 확인했다.
원리는 잘 모르겠다. 도대체 나는 동기화 이딴건 왜 찾아보고 있었나 이런 의문이 그냥 들었을뿐....
그냥 파이썬이 딕셔너리 키를 매칭해 찾는 속도가 개빠르구나.. 생각만 들었다. 혹시 구성할일 있으면 딕셔너리가 짱이니 잘쓰도록..

어쨌든, 쉽게 접근하면 쉽게 끝낼 수 있었던 과제였는데 너무 꼬아서 생각하는 바람에,
시간은 시간대로 잡아먹고 결과는... 안나왔지만 뻔할거라 생각한다만..

그래도 간만에 재밌는 과제였다 생각한다. 이런 짱구를 굴리는 과제가 재밌고 좋다.
이 개삽질이 누군가에게 도움되길 바라며 github 링크를 조심스럽게 링킹한다..

Reference
https://sites.google.com/site/reusablesec/Home/password-cracking-tools/probablistic_cracker
http://ko.wikipedia.org/wiki/%EC%97%AD%EC%83%81_%EA%B3%B5%EA%B2%A9
http://www.laurentluce.com/posts/python-and-cryptography-with-pycrypto/
http://www.xorbin.com/tools/sha256-hash-calculator
https://wiki.skullsecurity.org/Passwords
http://www.thegeekstuff.com/2014/07/advanced-python-regex/
http://stackoverflow.com/questions/8023306/get-key-by-value-in-dictionary
http://stackoverflow.com/questions/3449384/fastest-text-search-method-in-a-large-text-file

늘모자란, 개발

Sha256 Hash Dictionary attack Assignment with Python