Home > Archive > Fortran > January 2006 > random selection without replacement
You are viewing an archived Text-only version of the thread.
To view this thread in it's original format and/or if you want to reply to
this thread please [click here]
| Author |
random selection without replacement
|
|
|
| HI,
I plan a simulation design as follows:
1. write a main program to do the simulation
2. the main program includes two big subroutine programs
(1) read_dataset.for (This subroutine reads dataset files, including
several array values )
(2) random_selection.for (This subroutine do the random selection
from the dataset read from the subroutine "read_dataset.for")
My purpose is to select randomly from the data set for 5 times and
select without replacement. Therefore, All I think is to have the main
program look like this:
----------------------------------------
call read_dataset
do i=1,5
call random_selection
enddo
------------------------------------
However, I noticed that this call random_selection can only do once
because they don't have the dataset values after the 1st run is
finished. Is this true? Could anyone help me solve this problem?
Thank you very much!
| |
| Dick Russell 2006-01-17, 7:05 pm |
| Without knowing what is in random_selection for code and how it
accesses the data read by read_dataset, it's hard to comment. On the
surface, there is no reason not to assume that read_dataset will read
the data into a MODULE (or COMMON) area, and have random_selection USE
that module in non-destructive fashion to do its work. As long as the
use of the data doesn't write new values on top of the old values, the
whole data set is there for reuse by another call to random_selection.
Are these routines yours or something someone else wrote?
| |
|
| HI,
Thank you very much for trying to help me with such little information.
I am sorry this is a very stupid question. Do you mean that I need to
define the "common used" variables and arries using "common" statement?
I use the same name of the arries in both subroutines and assume they
will
be recognized and pass the value to random_selection. However, I do
want to change the original values (to mark for the select without
replacement). Therefore I change the array values after the first run.
I assume that the array values will be updated and can be used for the
2nd run. This won't work, right?
And yes,these routines are written by me.
| |
| Dick Russell 2006-01-18, 7:58 am |
| If random_selection must have acces to the data read in by
read_dataset, then you must do one of two things:
1. The main can define the arrays to contain the data and pass them as
arguments to read_dataset, to be loaded there with data. Upon return,
the main then can pass these arrays (now containing data) as arguments
to random_selection. If there are many arrays to be loaded, the
argument lists clearly are equally lengthy. This may be cumbersome.
2. The data can reside in arrays that are indeed "common" to any
routine needing access to that data. In the older Fortran 77, a COMMON
statement would be defined to create such a block of data (perhaps with
a name) and name the arrays, their sizes, and their order in the COMMON
area thus defined. This definition (one or more lines, as required),
would appear in the main and in any subroutine needing any of the data
(read_dataset and random_selection). Any changes to the contents of the
arrays subsequently are available to other users of the data. Care must
be taken to ensure the same layout of data arrays within that COMMON in
all routines using the area. The necessary statements defining the
COMMON area could be contained in a separate file, and an INCLUDE
statement naming that file would be placed in each routine referring to
the area. This would avoid having some routine use a different mapping
of arrays within the COMMON area. Use of COMMON areas still works in
the current Fortran standard, but the preferred approach is to define
such areas with MODULE and to gain access to them with USE statements.
MODULEs can provide additional capability; details are in the language
manuals.
Since all the code is yours, you can structure it in any way that suits
your purposes
| |
| Dick Russell 2006-01-18, 7:58 am |
| I should have mentioned that if an array is declared within a routine,
and it is not part of a COMMON area or MODULE (via USE), and it did not
come into the routine as a dummy array in the argument list, then it is
local to that routine. No other routine will have access to that local
array unless the define routine passes it to that other routine in an
argument list. That is why data that must be accessed by multiple
routines has to be in a COMMON or MODULE or be passed around through an
argument list.
| |
| Herman D. Knoble 2006-01-18, 7:58 am |
| Apologies if I misundersdand your request. As I understand
you, you want to shuffle the records. The following is a classic
shuffling algorithm with a sample driver.
Skip
program testrperm
integer, dimension(10) :: q
call rperm (q,10)
print *, q
! Sample output: 3 1 2 6 7 9 4 8 10 5
! There are 10! such permutations.
end program testrperm
subroutine rperm(p,m)
!===generate a random permutation, p, of the first m integers.
! (equivalent to shuffling, sampling without replacement).
! adaptation of knuth volume 2, algorithm 3.4.2p.
integer :: m, p(m), k, j, itemp
real :: u
do j=1,m
p(j)=j
end do
do j=m,1,-1
call random_number(u)
k=int(u*j)+1
itemp=p(j)
p(j)=p(k)
p(k)=itemp
end do
return
end
On 17 Jan 2006 07:10:35 -0800, "PCAT" <pcatchen@gmail.com> wrote:
-|HI,
-|I plan a simulation design as follows:
-|1. write a main program to do the simulation
-|2. the main program includes two big subroutine programs
-| (1) read_dataset.for (This subroutine reads dataset files, including
-|several array values )
-| (2) random_selection.for (This subroutine do the random selection
-|from the dataset read from the subroutine "read_dataset.for")
-|
-|My purpose is to select randomly from the data set for 5 times and
-|select without replacement. Therefore, All I think is to have the main
-|program look like this:
-|----------------------------------------
-|call read_dataset
-|do i=1,5
-| call random_selection
-|enddo
-|------------------------------------
-|However, I noticed that this call random_selection can only do once
-|because they don't have the dataset values after the 1st run is
-|finished. Is this true? Could anyone help me solve this problem?
-|Thank you very much!
| |
|
| HI,
Thank you very much. I think I need to check the MODULE and USE in the
manual.
Before I study that, I tried to use the first method you mentioned. I
found that the 1st run of "call random_selection" is okay. It did
produce output files I needed. However, for the 2nd run, it stucked
over there. Is there anything I missed? I assume that if the program
can run for one time, the values updated during the 1st run of "call
random_selection" should be able to pass to the 2nd run. Is this
correct?
________________________________________
________________________________
call read_dataset(ct,indx,ac,bc,cont,vwz,tg,i
dvwz)
do 21 v=1,r
do 20 w=1,s
do 19 z=1,t !t=3
do j=1,ct(v,w,z)
indx(v,w,z,j)=0
enddo
19 continue
20 continue
21 continue
do 13 i=1,run
call random_selection(indx,ct,vwz,ac,bc,cont,
tg,idvwz)
13 continue
________________________________________
___________________________
| |
| e p chandler 2006-01-18, 7:03 pm |
|
PCAT wrote:
> Before I study that, I tried to use the first method you mentioned. I
> found that the 1st run of "call random_selection" is okay. It did
> produce output files I needed. However, for the 2nd run, it stucked
> over there. Is there anything I missed? I assume that if the program
> can run for one time, the values updated during the 1st run of "call
> random_selection" should be able to pass to the 2nd run. Is this
> correct?
Usually when you run a program over and over again the random number
generator's seed is set to a default initial value. This produces the
same stream of random numbers unless you initialize the generator to
some other value. The advantage of this behavior is that it is easier
to test a modified program for expected results. Suppose you are doing
a simulation and also some matrix algebra. You change the matrix
routines and the answers are markedly different, then what do you do?
-- Elliot
e-mail: epc8 at juno dot com
| |
|
| HI,
I did update the "seed" for my random selection at the end of my
routine: random_selection. Therefore, the 2nd run will have a new seed.
I marked the indx array to indicate being selected or not. Then I
suppose call the random_selection again will know that those with "indx
array=1" will not be able to be selected again. I also wrote those
selected sample to another file as my first output. But I still don't
know why the 2nd run won't work and stucked there.
| |
| e p chandler 2006-01-18, 7:03 pm |
| [I suggested initializing the seed to a new value at the start of each
program run.]
PCAT wrote:
> HI,
> I did update the "seed" for my random selection at the end of my
> routine: random_selection. Therefore, the 2nd run will have a new seed.
> I marked the indx array to indicate being selected or not. Then I
> suppose call the random_selection again will know that those with "indx
> array=1" will not be able to be selected again. I also wrote those
> selected sample to another file as my first output. But I still don't
> know why the 2nd run won't work and stucked there.
I still don't know exactly what you are doing. Maybe I misunderstood
what you meant by "stucked there". Do you mean that the program slows
down or perhaps appears to produce no output while the original program
tries to generate a second set of values?
If you are flagging already selected values and then searching the flag
array for unselected values then the program will slow down as more and
more values are selected and less remain un-selected. [Just a thought.
:-)]
-- Elliot
| |
|
| HI,
Thanks a lot. I think the program is stopped, not slow down. The DOS
window just stucked there and won't show "press any key to continue".
Yes, I flagged those already selected values and then run the routine
again to do the same thing to select more but ingore the already
selected ones. Will it slow down the program? I don't know, but it
appears to me that the program didnot run though in the 2nd run.
| |
| Dr Ivan D. Reid 2006-01-19, 7:05 pm |
| On 19 Jan 2006 01:07:06 -0800, PCAT <pcatchen@gmail.com>
wrote in <1137661626.568440.81970@g14g2000cwa.googlegroups.com>:
> HI,
> Thanks a lot. I think the program is stopped, not slow down. The DOS
> window just stucked there and won't show "press any key to continue".
> Yes, I flagged those already selected values and then run the routine
> again to do the same thing to select more but ingore the already
> selected ones. Will it slow down the program? I don't know, but it
> appears to me that the program didnot run though in the 2nd run.
If you mean what I think you do, that is a very slow way to do
the selection -- and probably prone to errors that could lead to an infinite
loop.
Consider, if you have a population of 100. After selecting and
marking the first member, the second selection has 1 chance in 100 of
picking that member again and needing a second call to the random-number
generator. By the time you have selected 50 members there is an even
chance that your first pick will already have been selected and you'll
have to choose again, then a 50% chance that you need a third choice...
If you continue right down to choosing the 100th, unless you realise there's
only one choice yet, you'd need, on average, 100 choices before arriving
at the final one. Far better to move your choices out of the population and
adjust your selection to suit just the remaining population.
--
Ivan Reid, Electronic & Computer Engineering, ___ CMS Collaboration,
Brunel University. Ivan.Reid@[brunel.ac.uk|cern.ch] Room 40-1-B12, CERN
KotPT -- "for stupidity above and beyond the call of duty".
| |
|
| HI,
Thank you very much for your suggestion. I think that reomve the record
will definately speed up the process. My program is written based on
the following :
I have several subgroups within the population. Therefore, I have a
prespecified number of samples chosen from each subgroup.
My way to do the sampling is for example, categorized the 12 groups
into 2 dimension: 4*3=12 groups. Each subgroup has different numbers of
elements.
For example, subgroup(1,1) has total 13 elements. and subgroup(1,2) has
5 elements.
If my purpose is to randomly select 3 elements from subgroup (1,1) and
1 elements from subgroup(1,2). My way of doing this is to specify an
index array of "indx(4,3,n)" and initialize them to be zero. Then call
the digit random routine between (1,13) for subgroup (1,1) for 3 times
to select the sample. If the ith element in subgroup(1,1) is selected,
then indx(1,1,i)=1. Similarly, I will call random digit routine between
(1,5) for one time to select the sample from subgroup (1,2).If the jth
element in subgroup(1,2) is selected, then indx(1,2, j)=1.
I know this way will waste the time of calling the previously selected
random number. However, that is all I know to deal with this problem. I
don't know how to "remove" the record from the whole population and
still can do the random selection within each subgroup. Of course there
are some more complicated algorithms inside my routines, but I think
this is the main part of how I deal with the random selection. I know
this is probably a very simple problem. Please forgive me with no CS
background or training and has to do the programming. If possible,
could you suggest me an algorithm to deal with this kind of problem
more efficient? Thank you very much.
| |
| Ron Shepard 2006-01-19, 7:05 pm |
| In article <1137661626.568440.81970@g14g2000cwa.googlegroups.com>,
"PCAT" <pcatchen@gmail.com> wrote:
> HI,
> Thanks a lot. I think the program is stopped, not slow down. The DOS
> window just stucked there and won't show "press any key to continue".
> Yes, I flagged those already selected values and then run the routine
> again to do the same thing to select more but ingore the already
> selected ones. Will it slow down the program? I don't know, but it
> appears to me that the program didnot run though in the 2nd run.
If you want random selection from an array without replacement, then
isn't that the same thing as setting an integer index array with a
random permutation once, and just going through that index array a
block at a time with each pass? This way you don't have to keep
checking the elements to see if they have been selected previously,
and the last pass is just as efficient as the first pass.
$.02 -Ron Shepard
| |
|
| HI,
Thank you very much for your suggestion. I think that reomve the record
will definately speed up the process. My program is written based on
the following :
I have several subgroups within the population. Therefore, I have a
prespecified number of samples chosen from each subgroup.
My way to do the sampling is for example, categorized the 12 groups
into 2 dimension: 4*3=12 groups. Each subgroup has different numbers of
elements.
For example, subgroup(1,1) has total 13 elements. and subgroup(1,2) has
5 elements.
If my purpose is to randomly select 3 elements from subgroup (1,1) and
1 elements from subgroup(1,2). My way of doing this is to specify an
index array of "indx(4,3,n)" and initialize them to be zero. Then call
the digit random routine between (1,13) for subgroup (1,1) for 3 times
to select the sample. If the ith element in subgroup(1,1) is selected,
then indx(1,1,i)=1. Similarly, I will call random digit routine between
(1,5) for one time to select the sample from subgroup (1,2).If the jth
element in subgroup(1,2) is selected, then indx(1,2, j)=1.
I know this way will waste the time of calling the previously selected
random number. However, that is all I know to deal with this problem. I
don't know how to "remove" the record from the whole population and
still can do the random selection within each subgroup. Of course there
are some more complicated algorithms inside my routines, but I think
this is the main part of how I deal with the random selection. I know
this is probably a very simple problem. Please forgive me with no CS
background or training and has to do the programming. If possible,
could you suggest me an algorithm to deal with this kind of problem
more efficient? Thank you very much.
| |
| e p chandler 2006-01-19, 7:05 pm |
|
PCAT wrote:
> HI,
> Thank you very much for your suggestion. I think that reomve the record
> will definately speed up the process. My program is written based on
> the following :
> I have several subgroups within the population. Therefore, I have a
> prespecified number of samples chosen from each subgroup.
> My way to do the sampling is for example, categorized the 12 groups
> into 2 dimension: 4*3=12 groups. Each subgroup has different numbers of
> elements.
> For example, subgroup(1,1) has total 13 elements. and subgroup(1,2) has
> 5 elements.
> If my purpose is to randomly select 3 elements from subgroup (1,1) and
> 1 elements from subgroup(1,2). My way of doing this is to specify an
> index array of "indx(4,3,n)" and initialize them to be zero. Then call
> the digit random routine between (1,13) for subgroup (1,1) for 3 times
> to select the sample. If the ith element in subgroup(1,1) is selected,
> then indx(1,1,i)=1. Similarly, I will call random digit routine between
> (1,5) for one time to select the sample from subgroup (1,2).If the jth
> element in subgroup(1,2) is selected, then indx(1,2, j)=1.
>
> I know this way will waste the time of calling the previously selected
> random number. However, that is all I know to deal with this problem. I
> don't know how to "remove" the record from the whole population and
> still can do the random selection within each subgroup. Of course there
> are some more complicated algorithms inside my routines, but I think
> this is the main part of how I deal with the random selection. I know
> this is probably a very simple problem. Please forgive me with no CS
> background or training and has to do the programming. If possible,
> could you suggest me an algorithm to deal with this kind of problem
> more efficient? Thank you very much.
Here is some pseudo code:
determine the size of the largest possible subgroup, say K
dimension an array to size K or a number you know is larger
for each subgroup
the subgroup has N members, you want to select M w/o replacement
fill the array with with integers 1 to N
shuffle these integers in the array
choose the first M values in the array to select members of the
subgroup
go on to the next subgroup
** aside **
This looks like a fairly standard statistical sampling problem. Does
your institution have any statisticians on board? Also I believe that
standard statistical packages such as SAS have routines that let you
pick various types of random samples. If you are going to use such a
package for data analysis why not use it for sampling?
** end aside **
| |
|
| HI,
You are right. That seems like the same thing and will reduce the
searching time. I have another 3rd dimension constraint needs to be
considered. For example, each element of in the population has three
dimension. The first two are described in my previouse post above. Then
there is a 3rd dimension of each element, and the sum of each category
of the 3rd dimension should be satisfied.
I need to think it thourgh how to apply this approach in my algorithm.
Thank you very much!
| |
| e p chandler 2006-01-19, 7:05 pm |
| [question about multi-factor sampling w/o replacement]
PCAT wrote:
> HI,
> You are right. That seems like the same thing and will reduce the
> searching time. I have another 3rd dimension constraint needs to be
> considered. For example, each element of in the population has three
> dimension. The first two are described in my previouse post above. Then
> there is a 3rd dimension of each element, and the sum of each category
> of the 3rd dimension should be satisfied.
> I need to think it thourgh how to apply this approach in my algorithm.
> Thank you very much!
I am still unsure what you mean, but let's suppose that you have three
attributes with P,Q & R classes in each. You want to sample a subset of
the sample space w/o replacement.
Then all you need to do is select M of N = P * Q * R w/o replacement.
For each draw you use a suitable sequence of INT() and MOD() operations
to map integers 1..N to triplets (1..P,1..Q,1..R).
Does this help?
-- Elliot
| |
|
| HI,
Let me briefly describe a simpler version of what I want to do:
Each element in the sample space has three attributes: a,b,c. There are
4 categories in "a", 6 categories in "b" and 3 categories in "c"
(c1,c2,c3).
Then the sample space is divided into 4*6 subgroups.
I have a pre-specified no. of required elements from each subgroup.
Assume the final sample size is 30 after adding each selected elements
from each cell. I still need to make sure that the composition of the
sample for the attribute c is 5~8 from c1, 10~15 from c2, and 8~12 from
c3.
My purpose, for example, I want to select 5 *30 elements without
replacement and also satisfy the (c1,c2,c3)=(5~8,10~15,8~10) and the
no. of elements from each subgroup.
Therefore, my way of doing this is to have a part of satisfying the no.
of elements from each subgroup. At the same time when I select the
elements, the sum of c1, c2, c3 are also calculated. Therefore, I think
maybe mapping 1~N(no. of total elements) to (1...4, 1...6,1...3) cannot
work for my case.
I still hope that it works if I use a main program to do the simulation
and call the random_selection routine several times (with the mark of
selection and also many other constraints while selecting). This way, I
don' t need to change the whole program too much. Can it work?
| |
| e p chandler 2006-01-19, 7:05 pm |
| PCAT wrote:
> HI,
> Let me briefly describe a simpler version of what I want to do:
> Each element in the sample space has three attributes: a,b,c. There are
> 4 categories in "a", 6 categories in "b" and 3 categories in "c"
> (c1,c2,c3).
> Then the sample space is divided into 4*6 subgroups.
> I have a pre-specified no. of required elements from each subgroup.
> Assume the final sample size is 30 after adding each selected elements
> from each cell. I still need to make sure that the composition of the
> sample for the attribute c is 5~8 from c1, 10~15 from c2, and 8~12 from
> c3.
> My purpose, for example, I want to select 5 *30 elements without
> replacement and also satisfy the (c1,c2,c3)=(5~8,10~15,8~10) and the
> no. of elements from each subgroup.
> Therefore, my way of doing this is to have a part of satisfying the no.
> of elements from each subgroup. At the same time when I select the
> elements, the sum of c1, c2, c3 are also calculated. Therefore, I think
> maybe mapping 1~N(no. of total elements) to (1...4, 1...6,1...3) cannot
> work for my case.
>
> I still hope that it works if I use a main program to do the simulation
> and call the random_selection routine several times (with the mark of
> selection and also many other constraints while selecting). This way, I
> don' t need to change the whole program too much. Can it work?
Sorry but you've lost me at this point.
1. It looks like you want a weighted sample from c1,c2,c3. But I don't
understand how you reconcile this with "sample without replacement".
2. This is turning into a conversation more than a newsgroup posting.
So I suggest we take this off list. See my e-mail address in a previous
posting.
-- Elliot
| |
|
| To all people who suggested me every way of fixing my program:
I've finally rechecked all the codes and found a very stupid mistake
when I initialize one array.
Now my program can run the subroutine several times wtihout problems!
Thank you very much !!
| |
|
| To all people who suggested me every way of fixing my program:
I've finally rechecked all the codes and found a very stupid mistake
when I initialize one array.
Now my program can run the subroutine several times wtihout problems!
Thank you very much !!
|
|
|
|
|