How to filter only printable characters in a file on Bash (linux) or Python? -
i want make file including non-printable characters include printable characters. think problem related acscii control action, not find solution , not understand meaning of .[16d
(ascii control action character??) in following file.
hexdump of input file:
00000000: 4845 4c4c 4f20 5448 4953 2049 5320 5448 hello th 00000010: 4520 5445 5354 1b5b 3136 4420 2020 2020 e test.[16d 00000020: 2020 2020 2020 2020 2020 201b 5b31 3644 .[16d 00000030: 2020
when cat
ed file on bash
, got: "hello ". think because default cat
interprets ascii control action, 2 .[16d
s.
why 2 .[16d
strings make cat
file print "hello"?, and... how can make file include printable characters, i.e., "hello "?
the hexdump shows dot in .[16d
escape character, \x1b
.
esc[
nd
ansi escape code delete n
characters. esc[16d
tells terminal delete 16 characters, explains cat
output.
there various ways remove ansi escape codes file, either using bash commands (eg using sed
, in anubhava's answer) or python.
however, in cases this, may better run file through terminal emulator interpret existing editing control sequences in file, result file's author intended after applied editing sequences.
one way in python use pyte, python module implements simple vtxxx compatible terminal emulator. can install using pip
, , here docs on readthedocs.
here's simple demo program interprets data given in question. it's written python 2, it's easy adapt python 3. pyte
unicode-aware, , standard stream class expects unicode strings, example uses bytestream, can pass plain byte string.
#!/usr/bin/env python ''' pyte vtxxx terminal emulator demo interpret byte string containing text , ansi / vtxxx control sequences code adapted demo script in pyte tutorial @ http://pyte.readthedocs.org/en/latest/tutorial.html#tutorial posted http://stackoverflow.com/a/30571342/4014959 written pm 2ring 2015.06.02 ''' import pyte #hex dump of data #00000000 48 45 4c 4c 4f 20 54 48 49 53 20 49 53 20 54 48 |hello th| #00000010 45 20 54 45 53 54 1b 5b 31 36 44 20 20 20 20 20 |e test.[16d | #00000020 20 20 20 20 20 20 20 20 20 20 20 1b 5b 31 36 44 | .[16d| #00000030 20 20 | | data = 'hello test\x1b[16d \x1b[16d ' #create default sized screen tracks changed lines screen = pyte.diffscreen(80, 24) screen.dirty.clear() stream = pyte.bytestream() stream.attach(screen) stream.feed(data) #get index of last line containing text last = max(screen.dirty) #gather lines, stripping trailing whitespace lines = [screen.display[i].rstrip() in range(last + 1)] print '\n'.join(lines)
output
hello
hex dump of output
00000000 48 45 4c 4c 4f 0a |hello.|
Comments
Post a Comment