Page 1 of 2
Bash: Count numbers in list
Posted: Wed 22 Aug 2012, 14:55
by zigbert
Hello!
I am working on a new rating system in pMusic. I want to see how many times the track is played last week or last month.
The idea is to add 'date +%s' to the rating-index each time the track has played. The index file could look like this
Code: Select all
/root/file1.mp3| 1345641102,1345644563,1345647584,...
/root/file2.mp3| 1345623045,1345643786,1345648695,...
Now the question is:
How many times has file2 played since 1345643390?
In other words:
How many numbers in line 2 is above 1345643390?
The list will be long, so we need a fast calculation
Thank you
Sigmund
Posted: Wed 22 Aug 2012, 15:20
by Flash
So those numbers are time stamps that show each time the file was played? Are they in chronological order?
Couldn't you just find the number of commas in each line? (The reverse of counting sheep by counting their legs and dividing by four.
)
Re: Bash: Count numbers in list
Posted: Wed 22 Aug 2012, 16:24
by L18L
Code: Select all
#!/bin/ash
#
log="/root/file2.mp3| 1345623045,1345643786,1345648695"
#
#How many numbers in log is below 1345643390?
#How many numbers in log is below oneTIME
oneTIME=1345643390 # seconds since 1970-1-1
COUNT=0
for VAR in `echo $log | cut -d'|' -f2 | tr ',' '\n'`; do
[ $(( $VAR - $oneTIME )) -gt 0 ] && COUNT=$(( $COUNT + 1 ))
done
echo $COUNT numbers are above $oneTIME
The following should be faster if in chronological order:
Code: Select all
#!/bin/ash
#
log="/root/file2.mp3| 1345623045,1345643786,1345648695"
#
#How many numbers in log is above 1345643390?
#How many numbers in log is above oneTIME
oneTIME=1345643390 # seconds since 1970-1-1
COUNT=0
for VAR in `echo $log | cut -d'|' -f2 | tr ',' '\n'`; do
[ $(( $VAR - $oneTIME )) -lt 0 ] && COUNT=$(( $COUNT + 1 )) || break
done
echo $COUNT numbers are below $oneTIME
You have to subtract this from the total number of commas.
Posted: Wed 22 Aug 2012, 19:46
by seaside
zigbert,
If you don't care about dates, you might just do:
Code: Select all
grep <file name> music_play_log | wc -l
Regards,
s
Posted: Wed 22 Aug 2012, 20:54
by zigbert
Thank you guys for your feedback
Yes, the numbers will be in in chronological order.
@seaside
The rating system today shows how many times you have played the track - with no dates. I want more.....
@L18L
This is sure a solution, but it will be slow when rating-index grows.
Sigmund
Posted: Wed 22 Aug 2012, 21:04
by SFR
zigbert wrote:[...]it will be slow when rating-index grows.
How about binary search algorithm?
Although it looks pretty complex to implement (at least for me) when using such a type of index (item1, item2, item3...), but it should be easier with arrays.
Here's the snippet I found:
http://www.bashguru.com/2011/01/binarys ... cript.html
Greetings!
Count numbers in list
Posted: Thu 23 Aug 2012, 13:02
by L18L
SFR wrote:How about binary search algorithm?
Although it looks pretty complex to implement (at least for me) when using such a type of index (item1, item2, item3...), but it should be easier with arrays.
Here's the snippet I found:
http://www.bashguru.com/2011/01/binarys ... cript.html
Here is what I made from that snippet:
Code: Select all
#!/bin/sh
# SCRIPT : derived from: binarysearch.sh
# SCRIPT : bsearch.sh
# USAGE : bsearch.sh
# PURPOSE: Searches given number in a sorted list.
# \\\\ ////
# \\ - - //
# @ @
# ---oOOo-( )-oOOo---
#
#####################################################################
# Define Functions Here #
#####################################################################
binarysearch()
{
status=-1
i=1
array=($(echo "$@"))
LowIndex=0
HeighIndex=$((${#array[@]}-1))
while [ $LowIndex -lt $HeighIndex ]
do
MidIndex=$(($LowIndex+($HeighIndex-$LowIndex)/2))
MidElement=${array[$MidIndex]}
case $(($MidElement - $SearchedItem)) in
-*) LowIndex=$(($MidIndex+1)) ;;
0) status=0
searches=$i
return ;;
*) HeighIndex=$(($MidIndex-1)) ;;
esac
let i++
done
}
#####################################################################
# Variable Declaration #
#####################################################################
clear
echo "Enter Sorted Array Elements : "
read -a ARRAY
count=${#ARRAY[@]}
#####################################################################
# Main Script Starts Here #
#####################################################################
while [ : ]
do
echo -n "
Type in a number or break using CTRL-C: "
read SearchedItem
binarysearch "${ARRAY[@]}"
[ $SearchedItem -gt $MidElement ] \
&& echo "nothing found equal or later than $SearchedItem" \
|| echo "$SearchedItem found after $searches searches, MidIndex=$MidIndex MidElement=$MidElement"
done
Binary search application
Posted: Thu 23 Aug 2012, 15:50
by akash_rawal
And here's it's application in zigbert's problem:
Code: Select all
#!/bin/bash
#usage: played_since filename time
#data stored in database.txt
file="$1";
time="$2";
ifs_bak="$IFS"
IFS=""
while read -d "|" record; do
#check for file
if test "$record" = "$file"; then
IFS=","
#Get all timestamps in array
read -a times
#Binary search
i=$(( ${#times[*]}/2 ))
d=$(( $i/2 ))
test "$d" -eq "0" && d=1 #precaution
ulim=$(( ${#times[*]}-1 ))
while test "$i" -ge "0" -a "$i" -lt "$ulim"; do #check whether index is in range
if test "$time" -lt "${times[$i]}"; then
i=$(( $i-$d ))
elif test "$time" -gt "${times[$(( $i+1 ))]}"; then
i=$(( $i+$d ))
else
#found, tell no. of records greater than this
echo "$(( $ulim-$i ))"
exit
fi
d=$(( ($d/2) ))
test "$d" -eq "0" && d=1 #precaution
done
#In case of 'out of range'
if test "$ulim" -lt "0"; then
echo "0"
else
if test "$time" -lt "$times"; then
echo "${#times[*]}"
elif test "$time" -gt "${times[$ulim]}"; then
echo "0"
else
echo "bug" #This should not be reached
fi
fi
exit
fi
read trash
done < database.txt
IFS="$ifs_bak"
echo "File not found"
I have found that overhead of reading data from file into array is much larger than time required for binary search, we have to do something there.
Plus, if the lists are going to be really large it would be performance advantage to store timestamps of each track in a separate file.
I am using a c++ program to generate about 87 MB test file to search in.
Code: Select all
//g++ -Wall -o data_gen source.cpp
#include <iostream>
#include <fstream>
#include <cstdlib>
using namespace std;
int main()
{
ofstream strm("database.txt", ios::out);
int i, j;
for (i = 0; i < 100; i++)
{
unsigned long rand_time = 0;
strm << "file" << i << ".mp3|";
for (j = 0; j < 100000; j++)
{
rand_time += (rand() % 1000);
strm << rand_time << ",";
}
rand_time += (rand() % 1000);
strm << rand_time << "\n";
}
strm.close();
return 0;
}
Time taken for last file roughly equals time to generate the database.
Code: Select all
# time ./data_gen
real 0m6.873s
user 0m3.936s
sys 0m0.257s
# time ./played_since file0.mp3 48000000
3881
real 0m0.269s
user 0m0.173s
sys 0m0.027s
# time ./played_since file99.mp3 48000000
3672
real 0m6.849s
user 0m5.696s
sys 0m0.677s
#
Posted: Thu 23 Aug 2012, 16:59
by zigbert
Thank you guys
Pretty amazing you want to spend so much time helping out!!!!!
I made a test.
1.) Checking my existing rating index - there are about 2000 tracks in the index.
2.) Make a fake index based on the new structure including 2000 tracks and 25 timestamps for each track. All together 574kb.
3.) Run the played_since script to get an assumption of how fast it can build an updated list of favorite tracks.
Code: Select all
# cut -d'|' -f1 ./database.txt > /tmp/played_since; time while read I; do ./played_since.sh $I 1345643390; done < /tmp/played_since
...
...
...
24
24
24
24
24
24
24
24
24
24
real 0m30.799s
user 0m21.339s
sys 0m5.560s
#
I have a modern system, and still it takes half a minute to build a rating list. Is this an understanding why ratings in the big music-apps uses stars for rating instead of a exact status.
But still, it would be awesome to define a given time (last week, januar 2011, or whatever) and then get My favorite tracks in that period.
Sigmund
Posted: Thu 23 Aug 2012, 17:55
by akash_rawal
If you want to process all files then why not go this way:
Code: Select all
#!/bin/bash
#usage: played_since_foreach time
time="$1";
ifs_bak="$IFS"
IFS=""
while read -d "|" record; do
IFS=","
#Get all timestamps in array
read -a times
#Binary search
i=$(( ${#times[*]}/2 ))
d=$(( $i/2 ))
test "$d" -eq "0" && d=1 #precaution
ulim=$(( ${#times[*]}-1 ))
count=-1
while test "$i" -ge "0" -a "$i" -lt "$ulim"; do #check whether index is in range
if test "$time" -lt "${times[$i]}"; then
i=$(( $i-$d ))
elif test "$time" -gt "${times[$(( $i+1 ))]}"; then
i=$(( $i+$d ))
else
#found, tell no. of records greater than this
count="$(( $ulim-$i ))"
break
fi
d=$(( ($d/2) ))
test "$d" -eq "0" && d=1 #precaution
done
#In case of 'out of range'
if test "$count" -lt "0"; then
if test "$ulim" -lt "0"; then
count=0
else
if test "$time" -lt "$times"; then
count="${#times[*]}"
elif test "$time" -gt "${times[$ulim]}"; then
count=0
else
count="bug" #This should not be reached
fi
fi
fi
#Show result for this record
echo "$record: $count"
done < database.txt
IFS="$ifs_bak"
This takes about 20 seconds to process my 84 mb test file.
zigbert wrote:
2.) Make a fake index based on the new structure including 2000 tracks and 25 timestamps for each track. All together 574kb.
I think I overestimated the size
, such a file takes about 2s on my system using the above script. But I think it could be even faster.
Posted: Thu 23 Aug 2012, 18:32
by zigbert
1.1 sec here
But I think it could be even faster.
Would be good for those with older hardware. - And those are many in the Puppy world.
Thanks a lot (so far)
Sigmund
Posted: Thu 23 Aug 2012, 19:15
by zigbert
akash_rawal
I don't think 5 sec is problematic if user request building a specific rating list....
The speed issue arrives when pMusic wants to give a general rating list of the most popular songs in the Overview mode. A workaround could be to always put the last played track at the bottom of the rating list. When later requesting a quick rating-list we could just 'tail -n 200 > database.txt' before executing played_since_foreach.
That means your code will somehow do the job as is, but if speed improvements is possible, it would sure help.
Sigmund
Posted: Fri 24 Aug 2012, 01:32
by technosaurus
something like:
Code: Select all
LOGENTRY=`grep $filename log.txt`
#grep is faster than while read case on large files
STAMPS=${LOGENTRY#*|}
#you could use bash arrays instead - set allows "arrays" in sh, ash, etc...
set ${STAMPS//,/ }
#for speed change IFS to "," vs. the //,/ } but don't forget to reset it
while ([ $1 -lt $timestamp ]) do
shift #a hacky way to remove all entries < timestamp
done
# $# is the number of args passed or in this case set with "set ..."
echo The number of times played since $timestamp is $#
haven't tested for accuracy, but something along these lines should be pretty fast... even faster using busybox ash, based on past experience
Bash: Count numbers in list
Posted: Fri 24 Aug 2012, 12:27
by L18L
technosaurus wrote:...haven't tested for accuracy, but ...
Hope this will suffice for
accuracy...using zigbert´s original database
# cat database.txt
/root/file1.mp3| 1345641102,1345644563,1345647584
/root/file2.mp3| 1345623045,1345643786,1345648695
#
# ./file_times_played_since 1345644563
The number of times played /root/file1.mp3 since 1345644563 is 2
The number of times played /root/file2.mp3 since 1345644563 is 1
#
# ./file_times_played_since 1345648694
ash: 1345648694: unknown operand
The number of times played /root/file1.mp3 since 1345648694 is 0
The number of times played /root/file2.mp3 since 1345648694 is 1
#
file_times_played_since:
Code: Select all
#!/bin/ash
timestamp=$1
ifs_bak=$IFS
IFS=","
while read LOGENTRY; do
filename=${LOGENTRY%%|*}
#LOGENTRY=`grep $filename database.txt`
#grep is faster than while read case on large files
STAMPS=${LOGENTRY#*|}
#you could use bash arrays instead - set allows "arrays" in sh, ash, etc...
#set ${STAMPS//,/ }
set ${STAMPS}
#for speed change IFS to "," vs. the //,/ } but don't forget to reset it
while ([ $1 -lt $timestamp ]) do
shift #a hacky way to remove all entries < timestamp
done
# $# is the number of args passed or in this case set with "set ..."
echo The number of times played $filename since $timestamp is $#
done < database.txt
IFS=$ifs_bak
The only issue I have found is when result is zero:
ash: 1345648694: unknown operand
Posted: Fri 24 Aug 2012, 14:43
by technosaurus
Oops need to prepend the -lt test with a [ $1 ] && ... For the case where it has not been played in the time period
Posted: Fri 24 Aug 2012, 14:47
by zigbert
This sounds interesting.
In am on the run this weekend, so I can't compare executing-time until Monday.
Sigmund
Posted: Fri 24 Aug 2012, 14:50
by zigbert
technosaurus wrote:Oops need to prepend the -lt test with a [ $1 ] && ... For the case where it has not been played in the time period
....and that means????
Posted: Fri 24 Aug 2012, 16:27
by jamesbond
My submission:
Code: Select all
#!/bin/ash
LANG=C
FIND=3421 # timestamp to find (random)
TIMES=3650 # 3650 timestamps (tracks played everyday for 10 years) - number from 1 to $TIMES
SONGS=2000 # 2000 songs
# generate fake timestamp data
tstamps=$(seq -s, 1 $TIMES)
line="/path/to/song.mp3 | $tstamps"
#echo $line
# generate fake playlist (repeat timestamp data $SONGS times)
for a in $(seq 1 $SONGS); do
echo $line
done |
# do actual work
awk -F, -v TS=$FIND '{
split($1, a, "|"); $1=a[2] # fix first entry
max=NF; min=1;
while (max-min > 1) {
i=int( (max+min)/2 )
if (TS >= $i) min = i
else max = i
}
print a[1], min
}'
With TIMES=3650 and SONGS=2000 as above (meaning, 3650 timestamps per song, with 2000 songs), result:
Code: Select all
# time ./fav.sh > /dev/null
real 0m1.641s
user 0m2.916s
sys 0m0.073s
.
Using Ziggy's original parameter of 25 timestamps per song and 2000 songs (TIMES=25, SONGS=2000), result
Code: Select all
# time ./fav.sh > /dev/null
real 0m0.068s
user 0m0.080s
sys 0m0.010s
Using 25 timestamps for 20,000 songs (TIMES=25, SONGS=20000), result:
Code: Select all
# time ./fav.sh > /dev/null
real 0m0.479s
user 0m0.730s
sys 0m0.057s
Using 3650 timestamps for 20,000 songs (TIMES=3650, SONGS=20000), result:
Code: Select all
# time ./fav.sh > /dev/null
real 0m16.855s
user 0m30.308s
sys 0m0.520s
Of course these are on my machine and is not directly comparable with anyone else's number. Ziggy need to run this on his own machine to test
cheers!
Count numbers in list
Posted: Fri 24 Aug 2012, 17:05
by L18L
zigbert wrote:technosaurus wrote:Oops need to prepend the -lt test with a [ $1 ] && ... For the case where it has not been played in the time period
....and that means????
change
to
Code: Select all
while ( [ $1 ] && [ $1 -lt $timestamp ]) do
tested successfully (
no more: ash: 1345623045: unknown operand ):
# ./file_times_played_since 1345648694
The number of times played /root/file1.mp3 since 1345648694 is 0
The number of times played /root/file2.mp3 since 1345648694 is 1
#
Re: Count numbers in list
Posted: Fri 24 Aug 2012, 17:16
by technosaurus
@jamesbond - over 100 lines grep becomes faster than using builtins, and it doesn't slow it down significantly for less than 100 lines, so its probably worth calling grep in this case since we can't assume a user has/plays only a few songs.
L18L wrote:zigbert wrote:technosaurus wrote:Oops need to prepend the -lt test with a [ $1 ] && ... For the case where it has not been played in the time period
....and that means????
change
to
Code: Select all
while ( [ $1 ] && [ $1 -lt $timestamp ]) do
tested successfully (
no more: ash: 1345623045: unknown operand ):
# ./file_times_played_since 1345648694
The number of times played /root/file1.mp3 since 1345648694 is 0
The number of times played /root/file2.mp3 since 1345648694 is 1
#
thanks, I was/am posting from my phone.