Processing text files with Microsoft Text Driver

While ODBC is API for accessing databases Microsoft Text ODBC driver preinstalled in Windows can be used for processing text files (txt and csv). That means SQL queries like SELECT can be executed to txt and csv files!

Using SELECT queries makes some advanced features all possible from command line without scripting!

  • processing data as Numbers, Dates, Strings and other supported data types
  • sorting rows with ORDER BY clause and grouping with GROUP BY clause
  • selecting rows with conditions and LIKE clause
  • finding minimum, maximum, and average values
  • finding distinct values with SELECT DISTINCT query

In this article:

Microsoft Text Driver connection string

Driver={Microsoft Text driver (*.txt; *.csv)};DBQ=C:\ProgramData\My Program\Data\;DriverId=27;MaxBufferSize=2048;PageTimeout=5

DBQ parameter

This is the only mandatory parameter telling the driver what folder to use. It may contain spaces requiring no quots:

DBQ=C:\ProgramData\My Program\Data\;

One can tell the driver to use current folder with dot character (.) as path value:

DBQ=.

Command line for nhdbt program:

Connect to files in current folder and run all SQL queries from test.sql:

nhdbt "Microsoft Text driver (*.txt; *.csv)" "dbq=." test.sql

Microsoft Text Driver Schema.ini file

A Schema.ini file is required to describe data format in txt or csv files.

Schema.ini File (Text File Driver)

When the Text driver is used, the format of the text file is determined by using a schema information file. The schema information file is always named Schema.ini and always kept in the same directory as the text data source. The schema information file provides the IISAM with information about the general format of the file, the column name and data type information, and several other data characteristics. A Schema.ini file is always required for accessing fixed-length data. You should use a Schema.ini file when your text table contains DateTime, Currency, or Decimal data, or any time that you want more control over the handling of the data in the table.

Example of Schema.ini file for apache_access.csv data file:

[apache_access.csv]
Format=TabDelimited
ColNameHeader=false
CharacterSet=ANSI
Col1=IPAddress char width 15
Col2=AccessTime date
Col3=Request char width 250
Col4=Code Integer

As soon as Schema.ini file is saved in the folder the driver is able to accept connection.

DateTime format

Date and time formats is not explained on Schema.ini File (Text File Driver) so it can be a tricky thing.

By default the driver applies user’s short format.

For example default setting for English US will be dd/MM/yyyy HH:mm, but dd/MM/yyyy HH:mm:ss will also be accepted. Thus processing following log file with default English US setting will work while same attempt with European regional setting will fail because of date/time format mismatch:

70.69.152.165	02/22/2017 22:13:23	GET /blog/index.php HTTP/1.1...
128.196.108.201	02/22/2017 23:21:23	GET /favicon.ico HTTP/1.1...
147.106.118.104	02/22/2017 15:27:00	GET /company/ HTTP/1.1...

Date and time format can be changed with DateTimeFormat parameter. The custom format is applied to all Date columns in this file:

[apache_access.csv]
...
DateTimeFormat=mm/dd/yyyy hh:nn:ss
...

It is important that formatting rules don’t follow standards used in Windows:

  • nn has to be used instead of mm for minutes
  • AM/PM not supported so it is case insensitive
US format DateTimeFormat=mm/dd/yyyy hh:nn:ss
European format DateTimeFormat=dd/mm/yyyy hh:nn:ss
SQL format DateTimeFormat=yyyy-mm-dd hh:nn:ss

When data format in the file does not match expected date/time format NULL value will be used.

Examples

Below are some examples of executing SQL queries to a Tab delimited file:

70.69.152.165	2017-02-22 19:39:27	GET /_media/logo_my_v2.png HTTP/1.1	304...
147.106.118.104	2017-02-22 18:00:37	GET /_css/master.1334356838.css HTTP/1.1	200...
65.98.119.36	2017-02-22 16:32:25	GET /_js/master.js HTTP/1.1	200...

Schema.ini file is as follows (not all columns have to be declared):

[apache_access.csv]
Format=TabDelimited
ColNameHeader=false
CharacterSet=ANSI
DateTimeFormat=yyyy-mm-dd hh:nn:ss
Col1=IPAddress char width 15
Col2=AccessTime date
Col3=Request char width 250
Col4=Code Integer

nhdbt program used for testing. All the files available to download at the end of this page.

Retrieve count of rows

SELECT COUNT allows easily find number of all or specific lines in the file.

How to count all rows in the file:

select count(*) from apache_access.csv
   Expr1000 
----------- 
       9946 

How to count rows by Code:

select Code,count(Code) from apache_access.csv group by Code order by Code
       Code    Expr1001 
----------- ----------- 
        200        7059 
        302         407 
        304        1525 
        401         224 
        403           5 
        404         527 
        500          75 
        503         124

Finding aggregated values

How to find minimum and maximum values:

select min(AccessTime),max(AccessTime) from apache_access.csv
Expr1000            Expr1001            
------------------- ------------------- 
2017-02-22 15:00:22 2017-02-22 23:33:31

Of course AVG and SUM also supported for Integer values.

Retrieve distinct values

How to list only distinct IP addresses sorted alphabetically:

select distinct(IPAddress) from apache_access.csv order by IPAddress asc
IPAddress       
--------------- 
101.92.120.16   
128.196.108.201 
14.47.165.89    
147.106.118.104 
161.71.8.142    
19.174.45.8     
221.125.19.252  
30.75.225.192   
34.87.4.6       
46.139.54.184   
5.35.225.115    
52.87.131.109   
65.98.119.36    
70.69.152.165   
78.235.33.64    

Searching for text values

How to list users on Windows:

select distinct(UserAgent) from apache_access.csv where UserAgent like '%Windows%'
UserAgent                                                                                                                                                                                                                                                       
------------------------------------------------------------------------------------------------------------
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:23.0) Gecko/20131011 Firefox/23.0                               
Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/28.0.1467.0 Safari/537.36        
Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; Trident/4.0; InfoPath.2; SV1; .NET CLR 2.0.50727; WOW64)

How to list Opera user agents:

select distinct(UserAgent) from apache_access.csv where UserAgent like '%Opera%'
UserAgent                                                                                                                                                                                                                                                       
-------------------------------------------------------------------------------------
Opera/9.80 (iPad; Opera Mini/7.1.32694/27.1407; U; en) Presto/2.8.119 Version/11.10
Opera/9.80 (iPhone; Opera Mini/7.1.32694/27.1407; U; en) Presto/2.8.119 Version/11.10

Limitations

A delimited text line contains one or more data values separated by delimiters: commas, tabs, or a custom delimiter. The same delimiter must be used throughout the file.
Tables are limited to a maximum of 255 fields. Field names are limited to 64 characters, and field widths are limited to 32,766 characters. Records are limited to 65,000 bytes.

A text file can be opened only for a single user. Multiple users are not supported.

Downloads

All files include sample log file, Schema.ini, SQL file, test.cmd, and nhdbt.exe are available for downloading from Google Drive:
processing-text-files-with-microsoft-text-driver.zip

See also

More information on Microsoft Text driver can be found at Microsoft:
Schema.ini File (Text File Driver)
Text File Format (Text File Driver)

nhdbt is freeware program for Windows to execute SQL queries from command line.


Comments

One response to “Processing text files with Microsoft Text Driver”

  1. hyper kay Avatar
    hyper kay

    “nn has to be used instead of mm for minutes”

    you saved my sanity!
    i’m not joking. I am not a crazy person now because of you.
    thank you so much.

Leave a Reply

Your email address will not be published. Required fields are marked *